Description of problem: OSBS team has upgraded rcm-img-docker02.build.eng.bos.redhat.com to OCP 3.3.1 and now it takes about 15 mins for build to move from New status to Pending and further Version-Release number of selected component (if applicable): oc v3.1.1.6-33-g81eabcc kubernetes v1.1.0-origin-1107-g4c8e6f4 A server with both master and node services runnning How reproducible: Always Steps to Reproduce: 1. Start a build via brew: brew container-build extras-rhel-7.2-docker-candidate --repo-url="http://file.brq.redhat.com/~vrutkovs/buildroot.repo" git://pkgs.devel.redhat.com/rpms/osbs-buildroot-docker#f24f4e47dcb71fcbb9cd6bdab6b756265d8cf4fb --git-branch=private-vrutkovs-all-from-git --scratch --channel-override="testing" 2. Wait until a new build - 'scratch-<timestamp>' in 'default' namespace - appears on that machine Actual results: The build is stuck in New state and doesn't move to Pending and further Expected results: Build moves to Pending and goes on Additional info: Node restart seems to fix it - after ~15 mins the builds in queue will be scheduled correctly. The symptoms are similar to bug #1374407, however the build gets stuck in New state, not Pending.
ok will look into it today.
I tried it with 3.3.1 but could not reproduce. Could you provide master and node logs (with log level at least 5) when the issue happened and also any more information about your environment?
Vadim, I can select weekly or daily logs in kibana in the links you sent above but it only shows first 500 events, do you know how to save complete daily or weekly logs locally?
We noticed longer than 15 minutes. In some cases +6 hours. It looks like the culprit is that this cluster had too many builds, 2937. Once we deleted them, the issue went away. This is an all-in-one OSE cluster (v3.3.1.5).
Luiz, Could you attach logs during the time when the problem occurred or give access to the cluster to look at to me? I will look into it first this on Monday.
Created attachment 1239118 [details] Logs from 10.3.8.111 Extracted a part of logs from 10.3.8.111 machine for yesterday: osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40. but was scheduled on 15:40 only
(In reply to Vadim Rutkovsky from comment #19) > Created attachment 1239118 [details] > Logs from 10.3.8.111 > > Extracted a part of logs from 10.3.8.111 machine for yesterday: > osbs-buildroot-docker-private-vrutkovs-all-from-git-9 was created at 12:40. > but was scheduled on 15:40 only I think its because I had restarted master around 15:40 to increase log level to 5 (see my previous comments) and then the build was scheduled so these logs seem old and with log level 2 not providing much help. I am waiting when the problem happens again so that I can investigate with the increased log level.
Created attachment 1240503 [details] CPU Profiling Added a cpu profiling data for a cluster when no builds are running (but top still shows 140%+ cpu usage).
Created attachment 1241431 [details] Stacktrace Generated a stacktrace on the cluster using 'curl -o stack.txt http://localhost:6060/debug/pprof/goroutine?debug=2' It seems goroutine 4143 is stuck while applying a policy to a build: ... github.com/openshift/origin/pkg/client.(*builds).List(0xc8415abda0, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, 0x0, 0x0, ...) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/client/builds.go:48 +0x19f github.com/openshift/origin/pkg/build/client.OSClientBuildClient.List(0x7fa412ddb838, 0xc822cc1fe8, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, ...) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/client/clients.go:68 +0xab github.com/openshift/origin/pkg/build/client.(*OSClientBuildClient).List(0xc82a519f30, 0xc8226e1d10, 0x9, 0x0, 0x0, 0x0, 0x0, 0x7fa412da1800, 0xc8415abd80, 0x0, ...) <autogenerated>:4 +0xe7 github.com/openshift/origin/pkg/build/util.BuildConfigBuilds(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0xc82c1b92f0, 0xc849e786d8, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/util/util.go:124 +0xd6 github.com/openshift/origin/pkg/build/controller/policy.GetNextConfigBuild(0x7fa409ede2d0, 0xc82a519f30, 0xc8226e1d10, 0x9, 0xc828c0cb80, 0x19, 0x0, 0x0, 0x0, 0xc8582b1500, ...) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:96 +0xea github.com/openshift/origin/pkg/build/controller/policy.handleComplete(0x7fa409ede2d0, 0xc82a519f30, 0x7fa409ede2a8, 0xc82a519f20, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/policy.go:135 +0xf2 github.com/openshift/origin/pkg/build/controller/policy.(*SerialPolicy).OnComplete(0xc8292ba260, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/policy/serial.go:34 +0x5a github.com/openshift/origin/pkg/build/controller.(*BuildController).HandleBuild(0xc8292b8800, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/controller.go:99 +0x694 github.com/openshift/origin/pkg/build/controller/factory.(*BuildControllerFactory).Create.func1(0x4a01ea0, 0xc8315f33e0, 0x0, 0x0) /builddir/build/BUILD/atomic-openshift-git-0.0988966/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/factory/factory.go:106 +0x8b ...
I think I found a workaround for this: $ cat go_template_file {{.metadata.name}} - {{index .metadata.labels "openshift.io/build.start-policy"}} $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o go-template-file=go_template_file | grep "<no value>" > builds $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}' $ sudo systemctl restart atomic-openshift-master The issue seems to be caused by builds, created in OSE 3.2. Those didn't have 'build.start-policy' label set, so atomic-openshift-master was logging a lot of warnings and tried to set a policy for those. This wasted a lot of time - and the changes were not written back to the build object. Since build scheduler is limited with 30 mins, with a sufficient number of builds the process gets stuck eventually
Hi Vadim, Thanks, I am going through the information you provided and will get back to you.
(In reply to Vadim Rutkovsky from comment #26) > I think I found a workaround for this: > > $ cat go_template_file > {{.metadata.name}} - {{index .metadata.labels > "openshift.io/build.start-policy"}} > $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o > go-template-file=go_template_file | grep "<no value>" > builds > $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": > {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}' > $ sudo systemctl restart atomic-openshift-master > > The issue seems to be caused by builds, created in OSE 3.2. Those didn't > have 'build.start-policy' label set, so atomic-openshift-master was logging > a lot of warnings and tried to set a policy for those. > > This wasted a lot of time - and the changes were not written back to the > build object. Since build scheduler is limited with 30 mins, with a > sufficient number of builds the process gets stuck eventually So are you able to reproduce the issue after this work around?
(In reply to Avesh Agarwal from comment #28) > (In reply to Vadim Rutkovsky from comment #26) > > I think I found a workaround for this: > > > > $ cat go_template_file > > {{.metadata.name}} - {{index .metadata.labels > > "openshift.io/build.start-policy"}} > > $ oc get builds -a --no-headers -o name | xargs -n1 oc get -o > > go-template-file=go_template_file | grep "<no value>" > builds > > $ cat builds | cut -d ' ' -f 1 | xargs -n1 oc patch build -p '{"metadata": > > {"labels": {"'openshift.io/build.start-policy'": "Serial"}}}' > > $ sudo systemctl restart atomic-openshift-master > > > > The issue seems to be caused by builds, created in OSE 3.2. Those didn't > > have 'build.start-policy' label set, so atomic-openshift-master was logging > > a lot of warnings and tried to set a policy for those. > > > > This wasted a lot of time - and the changes were not written back to the > > build object. Since build scheduler is limited with 30 mins, with a > > sufficient number of builds the process gets stuck eventually > > So are you able to reproduce the issue after this work around? Yes, this worked for out dev cluster - after the master restart the builds were scheduled again and didn't get stuck for already several days
Could anyone provide more info on how QE verify this bug?
(In reply to Wenjing Zheng from comment #42) > Could anyone provide more info on how QE verify this bug? I think the only way to reproduce this is to create a lot of builds (~3000) and check their state - builds shouldn't hang up in New
Verified on 3.4.1.2. Reproduced by creating 3K Completed builds on 3.4.0.40 and then starting a new build with oc start-build. It took > 20 min for the build to actually run. Updated to 3.4.1.2 and a) started a new build on top of the existing builds. It started immediately b) deleted the project and retested with a project with 5K completed builds. New builds started immediately.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0218
*** Bug 1418906 has been marked as a duplicate of this bug. ***