Bug 1400132
| Summary: | Node reports as Ready, but it takes about 15 mins to schedule a build | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vadim Rutkovsky <vrutkovs> | ||||||||
| Component: | Build | Assignee: | Cesar Wong <cewong> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 3.3.1 | CC: | acarter, adellape, bparees, cewong, haowang, jokerman, knakayam, lucarval, mgoldman, mifiedle, mmccomas, mmilata, mrobson, pweil, rhowe, sten, trankin, twaugh, wzheng | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | 3.4.z | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: |
Cause:
The builds controller iterates through all builds in the system and processes completed builds to see if there are other builds that need to be started. It will keep iterating through completed builds regardless of when the build was completed.
Consequence:
Scheduling a new build can take a long time when there is a great number of completed builds for the controller to process (> 1000).
Fix:
Process a build once only on completion to decide whether other builds should be started. Ignore them in the regular build controller loop.
Result:
Builds start quickly regardless of how many completed builds exist in the system.
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-01-31 20:19:08 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Vadim Rutkovsky
2016-11-30 13:56:10 UTC
ok will look into it today. I tried it with 3.3.1 but could not reproduce. Could you provide master and node logs (with log level at least 5) when the issue happened and also any more information about your environment? Vadim, I can select weekly or daily logs in kibana in the links you sent above but it only shows first 500 events, do you know how to save complete daily or weekly logs locally? We noticed longer than 15 minutes. In some cases +6 hours. It looks like the culprit is that this cluster had too many builds, 2937. Once we deleted them, the issue went away. This is an all-in-one OSE cluster (v3.3.1.5). Luiz, Could you attach logs during the time when the problem occurred or give access to the cluster to look at to me? I will look into it first this on Monday. Created attachment 1239118 [details]
Logs from 10.3.8.111
Extracted a part of logs from 10.3.8.111 machine for yesterday: osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40. but was scheduled on 15:40 only
(In reply to Vadim Rutkovsky from comment #19) > Created attachment 1239118 [details] > Logs from 10.3.8.111 > > Extracted a part of logs from 10.3.8.111 machine for yesterday: > osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40. > but was scheduled on 15:40 only I think its because I had restarted master around 15:40 to increase log level to 5 (see my previous comments) and then the build was scheduled so these logs seem old and with log level 2 not providing much help. I am waiting when the problem happens again so that I can investigate with the increased log level. Created attachment 1240503 [details]
CPU Profiling
Added a cpu profiling data for a cluster when no builds are running (but top still shows 140%+ cpu usage).
Created attachment 1241431 [details] Stacktrace Generated a stacktrace on the cluster using 'curl -o stack.txt http://localhost:6060/debug/pprof/goroutine?debug=2' It seems goroutine 4143 is stuck while applying a policy to a build: ... github.com/openshift/origin/pkg/client.(*builds).List(0xc8415abda0, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, 0x0, 0x0, ...) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/client/builds.go:48 +0x19f github.com/openshift/origin/pkg/build/client.OSClientBuildClient.List(0x7fa412ddb838, 0xc822cc1fe8, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, ...) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/client/clients.go:68 +0xab github.com/openshift/origin/pkg/build/client.(*OSClientBuildClient).List(0xc82a519f30, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, ...) <autogenerated>:4 +0xe7 github.com/openshift/origin/pkg/build/util.BuildConfigBuilds(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0xc82c1b92f0, 0xc849e786d8, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/util/util.go:124 +0xd6 github.com/openshift/origin/pkg/build/controller/policy.GetNextConfigBuild(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0x0, 0x0, 0x0, 0xc8582b1500, ...) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:96 +0xea github.com/openshift/origin/pkg/build/controller/policy.handleComplete(0x7fa409ede2d0, 0xc82a519f30, 0x7fa409ede2a8, 0xc82a519f20, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:135 +0xf2 github.com/openshift/origin/pkg/build/controller/policy.(*SerialPolicy).OnComplete(0xc8292ba260, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/serial.go:34 +0x5a github.com/openshift/origin/pkg/build/controller.(*BuildController).HandleBuild(0xc8292b8800, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/controller.go:99 +0x694 github.com/openshift/origin/pkg/build/controller/factory.(*BuildControllerFactory).Create.func1(0x4a01ea0, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/factory/factory.go:106 +0x8b ... I think I found a workaround for this:
$ cat go_template_file
{{.metadata.name}} - {{index .metadata.labels "openshift.io/build.start-policy"}}
$ oc get builds -a --no-headers -o name | xargs -n1 oc get -o go-template-file=go_template_file | grep "<no value>" > builds
$ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}'
$ sudo systemctl restart atomic-openshift-master
The issue seems to be caused by builds, created in OSE 3.2. Those didn't have 'build.start-policy' label set, so atomic-openshift-master was logging a lot of warnings and tried to set a policy for those.
This wasted a lot of time - and the changes were not written back to the build object. Since build scheduler is limited with 30 mins, with a sufficient number of builds the process gets stuck eventually
Hi Vadim, Thanks, I am going through the information you provided and will get back to you. (In reply to Vadim Rutkovsky from comment #26) > I think I found a workaround for this: > > $ cat go_template_file > {{.metadata.name}} - {{index .metadata.labels > "openshift.io/build.start-policy"}} > $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o > go-template-file=go_template_file | grep "<no value>" > builds > $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": > {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}' > $ sudo systemctl restart atomic-openshift-master > > The issue seems to be caused by builds, created in OSE 3.2. Those didn't > have 'build.start-policy' label set, so atomic-openshift-master was logging > a lot of warnings and tried to set a policy for those. > > This wasted a lot of time - and the changes were not written back to the > build object. Since build scheduler is limited with 30 mins, with a > sufficient number of builds the process gets stuck eventually So are you able to reproduce the issue after this work around? (In reply to Avesh Agarwal from comment #28) > (In reply to Vadim Rutkovsky from comment #26) > > I think I found a workaround for this: > > > > $ cat go_template_file > > {{.metadata.name}} - {{index .metadata.labels > > "openshift.io/build.start-policy"}} > > $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o > > go-template-file=go_template_file | grep "<no value>" > builds > > $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": > > {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}' > > $ sudo systemctl restart atomic-openshift-master > > > > The issue seems to be caused by builds, created in OSE 3.2. Those didn't > > have 'build.start-policy' label set, so atomic-openshift-master was logging > > a lot of warnings and tried to set a policy for those. > > > > This wasted a lot of time - and the changes were not written back to the > > build object. Since build scheduler is limited with 30 mins, with a > > sufficient number of builds the process gets stuck eventually > > So are you able to reproduce the issue after this work around? Yes, this worked for out dev cluster - after the master restart the builds were scheduled again and didn't get stuck for already several days Could anyone provide more info on how QE verify this bug? (In reply to Wenjing Zheng from comment #42) > Could anyone provide more info on how QE verify this bug? I think the only way to reproduce this is to create a lot of builds (~3000) and check their state - builds shouldn't hang up in New Verified on 3.4.1.2. Reproduced by creating 3K Completed builds on 3.4.0.40 and then starting a new build with oc start-build. It took > 20 min for the build to actually run. Updated to 3.4.1.2 and a) started a new build on top of the existing builds. It started immediately b) deleted the project and retested with a project with 5K completed builds. New builds started immediately. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0218 *** Bug 1418906 has been marked as a duplicate of this bug. *** |