1400132 – Node reports as Ready, but it takes about 15 mins to schedule a build

Bug 1400132 - Node reports as Ready, but it takes about 15 mins to schedule a build

Summary: Node reports as Ready, but it takes about 15 mins to schedule a build

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	3.3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.4.z
Assignee:	Cesar Wong
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1418906 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-30 13:56 UTC by Vadim Rutkovsky
Modified:	2021-08-30 12:10 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The builds controller iterates through all builds in the system and processes completed builds to see if there are other builds that need to be started. It will keep iterating through completed builds regardless of when the build was completed. Consequence: Scheduling a new build can take a long time when there is a great number of completed builds for the controller to process (> 1000). Fix: Process a build once only on completion to decide whether other builds should be started. Ignore them in the regular build controller loop. Result: Builds start quickly regardless of how many completed builds exist in the system.
Clone Of:
Environment:
Last Closed:	2017-01-31 20:19:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Logs from 10.3.8.111 (451.74 KB, application/zip) 2017-01-10 14:11 UTC, Vadim Rutkovsky	no flags	Details
CPU Profiling (149.41 KB, application/x-xz) 2017-01-13 20:00 UTC, Vadim Rutkovsky	no flags	Details
Stacktrace (3.31 MB, text/plain) 2017-01-16 22:06 UTC, Vadim Rutkovsky	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1418906	medium	CLOSED	More than 1000+ complete builds makes new build start slow with New status or hang	2021-08-30 12:22:34 UTC
Red Hat Knowledge Base (Solution)	2902871	None	None	None	2017-02-07 14:52:43 UTC
Red Hat Product Errata	RHBA-2017:0218	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.4.1.2 bug fix update	2017-02-01 01:18:20 UTC

Internal Links: 1418906

Description Vadim Rutkovsky 2016-11-30 13:56:10 UTC

Description of problem:

OSBS team has upgraded rcm-img-docker02.build.eng.bos.redhat.com to OCP 3.3.1 and now it takes about 15 mins for build to move from New status to Pending and further

Version-Release number of selected component (if applicable):
oc v3.1.1.6-33-g81eabcc
kubernetes v1.1.0-origin-1107-g4c8e6f4

A server with both master and node services runnning

How reproducible:
Always

Steps to Reproduce:
1. Start a build via brew:

brew container-build extras-rhel-7.2-docker-candidate --repo-url="http://file.brq.redhat.com/~vrutkovs/buildroot.repo" git://pkgs.devel.redhat.com/rpms/osbs-buildroot-docker#f24f4e47dcb71fcbb9cd6bdab6b756265d8cf4fb --git-branch=private-vrutkovs-all-from-git --scratch --channel-override="testing"

2. Wait until a new build - 'scratch-<timestamp>' in 'default' namespace - appears on that machine

Actual results:
The build is stuck in New state and doesn't move to Pending and further

Expected results:
Build moves to Pending and goes on

Additional info:
Node restart seems to fix it - after ~15 mins the builds in queue will be scheduled correctly.

The symptoms are similar to bug #1374407, however the build gets stuck in New state, not Pending.

Comment 2 Avesh Agarwal 2016-12-01 02:12:04 UTC

ok will look into it today.

Comment 3 Avesh Agarwal 2016-12-01 10:42:56 UTC

I tried it with 3.3.1 but could not reproduce. 

Could you provide master and node logs (with log level at least 5) when the issue happened and also any more information about your environment?

Comment 5 Avesh Agarwal 2016-12-01 13:07:10 UTC

Vadim, I can select weekly or daily logs in kibana in the links you sent above but it only shows first 500 events, do you know how to save complete daily or weekly logs locally?

Comment 6 Luiz Carvalho 2016-12-01 16:50:44 UTC

We noticed longer than 15 minutes. In some cases +6 hours.
It looks like the culprit is that this cluster had too many builds, 2937.
Once we deleted them, the issue went away.

This is an all-in-one OSE cluster (v3.3.1.5).

Comment 8 Avesh Agarwal 2017-01-06 22:04:21 UTC

Luiz,

Could you attach logs during the time when the problem occurred or give access to the cluster to look at to me? I will look into it first this on Monday.

Comment 19 Vadim Rutkovsky 2017-01-10 14:11:28 UTC

Created attachment 1239118 [details]
Logs from 10.3.8.111

Extracted a part of logs from 10.3.8.111 machine for yesterday: osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40. but was scheduled on 15:40 only

Comment 20 Avesh Agarwal 2017-01-10 14:19:37 UTC

(In reply to Vadim Rutkovsky from comment #19)
> Created attachment 1239118 [details]
> Logs from 10.3.8.111
> 
> Extracted a part of logs from 10.3.8.111 machine for yesterday:
> osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40.
> but was scheduled on 15:40 only

I think its because I had restarted master around 15:40 to increase log level to 5  (see my previous comments) and then the build was scheduled so these logs seem old and with log level 2 not providing much help. I am waiting when the problem happens again so that I can investigate with the increased log level.

Comment 24 Vadim Rutkovsky 2017-01-13 20:00:20 UTC

Created attachment 1240503 [details]
CPU Profiling

Added a cpu profiling data for a cluster when no builds are running (but top still shows 140%+ cpu usage).

Comment 25 Vadim Rutkovsky 2017-01-16 22:06:47 UTC

Created attachment 1241431 [details]
Stacktrace

Generated a stacktrace on the cluster using
'curl -o stack.txt http://localhost:6060/debug/pprof/goroutine?debug=2'

It seems goroutine 4143 is stuck while applying a policy to a build:

...
github.com/openshift/origin/pkg/client.(*builds).List(0xc8415abda0, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, 0x0, 0x0, ...)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/client/builds.go:48 +0x19f
github.com/openshift/origin/pkg/build/client.OSClientBuildClient.List(0x7fa412ddb838, 0xc822cc1fe8, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, ...)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/client/clients.go:68 +0xab
github.com/openshift/origin/pkg/build/client.(*OSClientBuildClient).List(0xc82a519f30, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, ...)
	<autogenerated>:4 +0xe7
github.com/openshift/origin/pkg/build/util.BuildConfigBuilds(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0xc82c1b92f0, 0xc849e786d8, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/util/util.go:124 +0xd6
github.com/openshift/origin/pkg/build/controller/policy.GetNextConfigBuild(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0x0, 0x0, 0x0, 0xc8582b1500, ...)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:96 +0xea
github.com/openshift/origin/pkg/build/controller/policy.handleComplete(0x7fa409ede2d0, 0xc82a519f30, 0x7fa409ede2a8, 0xc82a519f20, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:135 +0xf2
github.com/openshift/origin/pkg/build/controller/policy.(*SerialPolicy).OnComplete(0xc8292ba260, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/serial.go:34 +0x5a
github.com/openshift/origin/pkg/build/controller.(*BuildController).HandleBuild(0xc8292b8800, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/controller.go:99 +0x694
github.com/openshift/origin/pkg/build/controller/factory.(*BuildControllerFactory).Create.func1(0x4a01ea0, 0xc8315f33e0, 0x0, 0x0)
	/builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/factory/factory.go:106 +0x8b
...

Comment 26 Vadim Rutkovsky 2017-01-17 17:05:37 UTC

I think I found a workaround for this:

$ cat go_template_file
{{.metadata.name}} - {{index .metadata.labels "openshift.io/build.start-policy"}}
$ oc get builds -a --no-headers -o name | xargs -n1 oc get -o go-template-file=go_template_file | grep "<no value>" > builds
$ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}'
$ sudo systemctl restart atomic-openshift-master

The issue seems to be caused by builds, created in OSE 3.2. Those didn't have 'build.start-policy' label set, so atomic-openshift-master was logging a lot of warnings and tried to set a policy for those. 

This wasted a lot of time - and the changes were not written back to the build object. Since build scheduler is limited with 30 mins, with a sufficient number of builds the process gets stuck eventually

Comment 27 Avesh Agarwal 2017-01-17 17:24:36 UTC

Hi Vadim,

Thanks, I am going through the information you provided and will get back to you.

Comment 28 Avesh Agarwal 2017-01-19 18:53:24 UTC

(In reply to Vadim Rutkovsky from comment #26)
> I think I found a workaround for this:
> 
> $ cat go_template_file
> {{.metadata.name}} - {{index .metadata.labels
> "openshift.io/build.start-policy"}}
> $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o
> go-template-file=go_template_file | grep "<no value>" > builds
> $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata":
> {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}'
> $ sudo systemctl restart atomic-openshift-master
> 
> The issue seems to be caused by builds, created in OSE 3.2. Those didn't
> have 'build.start-policy' label set, so atomic-openshift-master was logging
> a lot of warnings and tried to set a policy for those. 
> 
> This wasted a lot of time - and the changes were not written back to the
> build object. Since build scheduler is limited with 30 mins, with a
> sufficient number of builds the process gets stuck eventually

So are you able to reproduce the issue after this work around?

Comment 30 Vadim Rutkovsky 2017-01-19 19:52:14 UTC

(In reply to Avesh Agarwal from comment #28)
> (In reply to Vadim Rutkovsky from comment #26)
> > I think I found a workaround for this:
> > 
> > $ cat go_template_file
> > {{.metadata.name}} - {{index .metadata.labels
> > "openshift.io/build.start-policy"}}
> > $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o
> > go-template-file=go_template_file | grep "<no value>" > builds
> > $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata":
> > {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}'
> > $ sudo systemctl restart atomic-openshift-master
> > 
> > The issue seems to be caused by builds, created in OSE 3.2. Those didn't
> > have 'build.start-policy' label set, so atomic-openshift-master was logging
> > a lot of warnings and tried to set a policy for those. 
> > 
> > This wasted a lot of time - and the changes were not written back to the
> > build object. Since build scheduler is limited with 30 mins, with a
> > sufficient number of builds the process gets stuck eventually
> 
> So are you able to reproduce the issue after this work around?

Yes, this worked for out dev cluster - after the master restart the builds were scheduled again and didn't get stuck for already several days

Comment 42 Wenjing Zheng 2017-01-24 09:12:09 UTC

Could anyone provide more info on how QE verify this bug?

Comment 43 Vadim Rutkovsky 2017-01-24 11:25:03 UTC

(In reply to Wenjing Zheng from comment #42)
> Could anyone provide more info on how QE verify this bug?

I think the only way to reproduce this is to create a lot of builds (~3000) and check their state - builds shouldn't hang up in New

Comment 50 Mike Fiedler 2017-01-25 18:46:15 UTC

Verified on 3.4.1.2.

Reproduced by creating 3K Completed builds on 3.4.0.40 and then starting a new build with oc start-build.   It took > 20 min for the build to actually run.

Updated to 3.4.1.2 and
   a) started a new build on top of the existing builds.  It started immediately
   b) deleted the project and retested with a project with 5K completed builds.   New builds started immediately.

Comment 53 errata-xmlrpc 2017-01-31 20:19:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0218

Comment 54 Ryan Howe 2017-02-07 14:55:40 UTC

*** Bug 1418906 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.