https://github.com/openshift/origin/blob/7876dd52e1c5a44293fec095847750cd2e76db82/pkg/build/apiserver/registry/buildconfiginstantiate/rest.go#L271-L288 This error apparently doesn't get logger, if there are networking problem, it's impossible to see this error. https://github.com/openshift/openshift-apiserver/blob/31c27b6e999dcc63d8063b6dfe101349f8e54028/pkg/build/apiserver/registry/buildconfiginstantiate/rest.go#L207 Perhaps the error should be logger as well.
switching to build team even then the actual change will be under openshift-apiserver
@gabe Hi, I am not sure what kind of error should I check, when I do binary build, have error info, seems it's normal error. Thanks [root@wangwen ~]# oc start-build sti --from-dir=vertx-http-example/ Uploading directory "vertx-http-example" as binary input for the build ... ............................................................ Uploading finished Error from server (BadRequest): unable to wait for build sti-2 to run: timed out waiting for the condition
Sorry Wen I meant to update this bug with some verification instructions but then got sidetracked. All we did was augment error logging in the openshift apiserver when processing binary builds. And it was for the most part for stress induced conditions that might be hard to fabricate. Let's do this: 1) clarify what you did exactly to sabotage your binary build so that it would faile 2) then reproduce and send me the pod logs for all the api server pods...I'll see if it falls into our augmented logging and verify the logs occurred. 3) otherwise, we'll see if we can tweak your repro to hit one of the scenarios 4) else, we'll claim due diligence and just mark this as verified.
Hi, Gabe I did two things, I am curious the build in project wewang1 failed, but build complete in project wewang, I didnot do extral network things, thought build should complete in project wewang1 -----Create binary build-------- Steps: 1. Create a project: wewang 2. Create secret to access registry $ oc create secret generic pullsecret --from-file=.dockerconfigjson=/root/.docker/config.json --type=kubernetes.io/dockerconfigjson 3. link secret to builder sa $oc secret link builder pullsecret 4. Create app $oc new-app registry.redhat.io/openjdk/openjdk-11-rhel8:latest~https://github.com/openshift-vertx-examples/vertx-http-example.git --strategy=source 5. Build complete in project wewang $[root@wangwen ~]# oc get build -n wewang NAME TYPE FROM STATUS STARTED DURATION vertx-http-example-1 Source Git@71ef489 Complete About an hour ago 2m22s -----Create sti build----------- Steps: 1. Create a project: wewang1 2. Use bc apiVersion: build.openshift.io/v1 kind: BuildConfig metadata: name: sti spec: output: to: kind: ImageStreamTag name: sti:latest postCommit: {} resources: {} runPolicy: Serial source: binary: {} type: Binary strategy: sourceStrategy: from: kind: DockerImage name: registry.redhat.io/openjdk/openjdk-11-rhel8:latest type: Source 3. Create secret to access registry 4. link secret to builder and default sa 5. clone https://github.com/openshift-vertx-examples/vertx-http-example.git 6. oc start-build sti --from-dir=vertx-http-example/ [root@wangwen ~]# oc start-build sti --from-dir=vertx-http-example/ Uploading directory "vertx-http-example" as binary input for the build ... ............................................................ Uploading finished Error from server (BadRequest): unable to wait for build sti-2 to run: timed out waiting for the condition [root@wangwen ~]# oc get builds NAME TYPE FROM STATUS STARTED DURATION sti-1 Source Binary@71ef489 Cancelled (CancelledBuild) About an hour ago sti-2 Source Binary@71ef489 Cancelled (CancelledBuild) 49 minutes ago 7. Check the pod logs in openshift-apiserver, full logs is attached [root@wangwen ~]# oc logs -f pod/apiserver-qk8zd -n openshift-apiserver|grep -E "generator.go|rest.go" W0203 12:39:50.006235 1 rest.go:239] unable to wait for build sti-1 to run: timed out waiting for the condition W0203 15:22:13.126898 1 rest.go:239] unable to wait for build sti-2 to run: timed out waiting for the condition
Created attachment 1657565 [details] openshift-apiserver pod log
Happy accident then @Wen :-) The problem you ran into in fact triggered some of the new logging we added. Namely these 2: I0204 12:20:09.449638 1 generator.go:724] Error resolving ImageStreamTag openjdk-11-rhel8:latest in namespace wewang: unable to find latest tagged image W0204 13:02:42.700227 1 rest.go:239] unable to wait for build sti-2 to run: timed out waiting for the condition So we can verify this bug, which is centered around the increased apiserver error logging. Now, why did you get the error? based on 'I0204 12:20:09.449638 1 generator.go:724] Error resolving ImageStreamTag openjdk-11-rhel8:latest in namespace wewang: unable to find latest tagged image' that would imply that the build config source strategy from was of kind ImageStreamTag and not DockerImage, which does not line up with the BC you noted in #Comment #5 So, we are going to do this: 1) mark this bug Verified for the apiserver error logging ... I'll do this now 2) give me a repro of this where you - oc get bc sti -o yaml - oc start-build sti --from-dir=vertx-http-example --loglevel=10 - oc get pod <build pod> -o yaml after the build is complete 3) If it still fails, and after I look at the data from 2), something seems amiss, then you open a new bugzilla against Builds, and we'll go from there
Here's the info I got: - oc get bc sti -o yaml http://pastebin.test.redhat.com/833187 - oc start-build sti --from-dir=vertx-http-example --loglevel=10 http://pastebin.test.redhat.com/833185 - oc get pod <build pod> -o yaml did not create any pod in my project
a little context (which may already be understood, but just in case it's not): The build is indeed being cancelled..it's apiserver that cancels it: https://github.com/openshift/openshift-apiserver/blob/master/pkg/build/apiserver/registry/buildconfiginstantiate/rest.go#L219-L227 this is sort of unique to the binary build flow. The binary instantiate api is effectively the real "build client" here (it creates the build object and waits for the build pod and then just relies status back to the oc client). So when the build pod doesn't start running in time, it does it's best to clean up the potential mess by canceling the build. i'd say that's all reasonable behavior. Now, could we do any better in terms of reflecting why the build was canceled, in the build object? probably not easily today (I don't think there's a way to pass a reason when canceling a build, we just mark it for cancelation and then the OCM updates the build phase to canceled). now on to the questions: 1) i don't think we can safely always do that, which is why i elected to clear it out...i worry that leaving the messages in place (especially when changing the condition to false) will lead to confusing information for users). Note that i was also a bit hamstrung by my implementation approach which precluded leaving multiple conditions set (handling multiple conditions being set would have been a much more complicated implementation). 2) yes, the build was canceled (by the binaryinstantiate api flow), i still believe canceling is the right thing to do, so the build should reflect that it was canceled (both in its phase + its conditions) 3) maybe, but it needs to be done carefully. depending on the error we got back, "latest" might be nil or otherwise unhelpful. But yes in this one specific case we could probably look at "latest" and extract the "message" from the "new" phase/condition, and return it as part of the error that goes back to the oc client. Something like ("Gave up waiting for build, build was still in New due to: %s",message)
thanks for the info Ben Wen - Based on Ben's feedback I think the risk / reward is too negative to try to optimize this edge error reporting case. IMO let's hold off on any additional bugzillas here.
Yes, Thanks Gabe and Ben for the explanation.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581