1795705 – Improve builds error logging

Bug 1795705 - Improve builds error logging

Summary: Improve builds error logging

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Gabe Montero
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-01-28 16:47 UTC by Oleg Bulatov
Modified:	2020-05-04 11:27 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Image stream tag generations were no bumped if changing image stream specifications still lead to an import error. Consequence: Users changing their samples operator samplesRegistry setting were lead to believe the samples operator did not take the config change when they looked at the image stream status. Fix: The openshift apiserver was augmented to take the failure specifics in the tag event fields as a reason to bump the generation in the tag's status. Result: If a user's change to the samples operator's samplesRegistry field still leads to an image stream import error, the new failure reasons now appear in the image stream status.
Clone Of:
Environment:
Last Closed:	2020-05-04 11:27:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
openshift-apiserver pod log (347.51 KB, text/plain) 2020-02-04 14:09 UTC, wewang	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift openshift-apiserver pull 70	None	closed	Bug 1795705: increase error logging around apiserver binary build flow	2020-05-17 14:38:25 UTC
Github	openshift openshift-apiserver pull 71	None	closed	Bug 1795705: return actual instantiate error for binary builds	2020-05-17 14:38:25 UTC
Red Hat Product Errata	RHBA-2020:0581	None	None	None	2020-05-04 11:27:45 UTC

Description Oleg Bulatov 2020-01-28 16:47:21 UTC

https://github.com/openshift/origin/blob/7876dd52e1c5a44293fec095847750cd2e76db82/pkg/build/apiserver/registry/buildconfiginstantiate/rest.go#L271-L288

This error apparently doesn't get logger, if there are networking problem, it's impossible to see this error.

https://github.com/openshift/openshift-apiserver/blob/31c27b6e999dcc63d8063b6dfe101349f8e54028/pkg/build/apiserver/registry/buildconfiginstantiate/rest.go#L207

Perhaps the error should be logger as well.

Comment 1 Gabe Montero 2020-01-28 18:46:51 UTC

switching to build team even then the actual change will be under openshift-apiserver

Comment 3 wewang 2020-02-03 15:29:09 UTC

@gabe Hi, I am not sure what kind of error should I check, when I do binary build, have error info, seems it's normal error. Thanks
[root@wangwen ~]# oc start-build sti --from-dir=vertx-http-example/
Uploading directory "vertx-http-example" as binary input for the build ...
............................................................
Uploading finished
Error from server (BadRequest): unable to wait for build sti-2 to run: timed out waiting for the condition

Comment 4 Gabe Montero 2020-02-03 16:30:16 UTC

Sorry Wen I meant to update this bug with some verification instructions but then got sidetracked.

All we did was augment error logging in the openshift apiserver when processing binary builds.  And it was 
for the most part for stress induced conditions that might be hard to fabricate.

Let's do this:

1) clarify what you did exactly to sabotage your binary build so that it would faile

2) then reproduce and send me the pod logs for all the api server pods...I'll see if 
it falls into our augmented logging and verify the logs occurred.

3) otherwise, we'll see if we can tweak your repro to hit one of the scenarios

4) else, we'll claim due diligence and just mark this as verified.

Comment 5 wewang 2020-02-04 14:08:04 UTC

Hi, Gabe I did two things, I am curious the build in project wewang1 failed, but build complete in project wewang, I didnot do extral network things, thought build should complete in project wewang1
-----Create binary build--------
Steps:
 1. Create a project: wewang 

 2. Create secret to access registry
   $ oc create secret generic pullsecret --from-file=.dockerconfigjson=/root/.docker/config.json  --type=kubernetes.io/dockerconfigjson

 3. link secret to builder sa
   $oc secret link builder pullsecret

 4. Create app
   $oc new-app registry.redhat.io/openjdk/openjdk-11-rhel8:latest~https://github.com/openshift-vertx-examples/vertx-http-example.git --strategy=source
 
 5. Build complete in project wewang
   $[root@wangwen ~]# oc get build -n wewang
NAME                   TYPE     FROM          STATUS     STARTED             DURATION
vertx-http-example-1   Source   Git@71ef489   Complete   About an hour ago   2m22s



-----Create sti build-----------
Steps:
1. Create a project: wewang1

2. Use bc
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: sti
spec:
  output:
    to:
      kind: ImageStreamTag
      name: sti:latest
  postCommit: {}
  resources: {}
  runPolicy: Serial
  source:
    binary: {}
    type: Binary
  strategy:
    sourceStrategy:
      from:
        kind: DockerImage
        name: registry.redhat.io/openjdk/openjdk-11-rhel8:latest
    type: Source

3. Create secret to access registry

4. link secret to builder and default sa

5. clone https://github.com/openshift-vertx-examples/vertx-http-example.git

6. oc start-build sti  --from-dir=vertx-http-example/
[root@wangwen ~]# oc start-build sti  --from-dir=vertx-http-example/
Uploading directory "vertx-http-example" as binary input for the build ...
............................................................
Uploading finished
Error from server (BadRequest): unable to wait for build sti-2 to run: timed out waiting for the condition

[root@wangwen ~]#  oc get builds
NAME    TYPE     FROM             STATUS                       STARTED             DURATION
sti-1   Source   Binary@71ef489   Cancelled (CancelledBuild)   About an hour ago   
sti-2   Source   Binary@71ef489   Cancelled (CancelledBuild)   49 minutes ago      

7. Check the pod logs in openshift-apiserver, full logs is attached
[root@wangwen ~]# oc logs -f pod/apiserver-qk8zd -n openshift-apiserver|grep -E "generator.go|rest.go"
W0203 12:39:50.006235       1 rest.go:239] unable to wait for build sti-1 to run: timed out waiting for the condition
W0203 15:22:13.126898       1 rest.go:239] unable to wait for build sti-2 to run: timed out waiting for the condition

Comment 6 wewang 2020-02-04 14:09:41 UTC

Created attachment 1657565 [details]
openshift-apiserver pod log

Comment 7 Gabe Montero 2020-02-04 20:42:10 UTC

Happy accident then @Wen :-)

The problem you ran into in fact triggered some of the new logging we added.

Namely these 2:
I0204 12:20:09.449638       1 generator.go:724] Error resolving ImageStreamTag openjdk-11-rhel8:latest in namespace wewang: unable to find latest tagged image
W0204 13:02:42.700227       1 rest.go:239] unable to wait for build sti-2 to run: timed out waiting for the condition

So we can verify this bug, which is centered around the increased apiserver error logging.

Now, why did you get the error?


based on 'I0204 12:20:09.449638       1 generator.go:724] Error resolving ImageStreamTag openjdk-11-rhel8:latest in namespace wewang: unable to find latest tagged image'
that would imply that the build config source strategy from was of kind ImageStreamTag and not DockerImage, which does not 
line up with the BC you noted in #Comment #5

So, we are going to do this:
1) mark this bug Verified for the apiserver error logging ... I'll do this now
2) give me a repro of this where you 
   - oc get bc sti -o yaml
   - oc start-build sti --from-dir=vertx-http-example --loglevel=10
   - oc get pod <build pod> -o yaml after the build is complete

3) If it still fails, and after I look at the data from 2), something seems amiss, then you open a new bugzilla against Builds, and we'll go from there

Comment 8 wewang 2020-02-05 09:59:48 UTC

Here's the info I got:

- oc get bc sti -o yaml
  http://pastebin.test.redhat.com/833187

- oc start-build sti --from-dir=vertx-http-example --loglevel=10
  http://pastebin.test.redhat.com/833185

- oc get pod <build pod> -o yaml 
  did not create any pod in my project

Comment 10 Ben Parees 2020-02-05 22:28:38 UTC

a little context (which may already be understood, but just in case it's not):

The build is indeed being cancelled..it's apiserver that cancels it:
https://github.com/openshift/openshift-apiserver/blob/master/pkg/build/apiserver/registry/buildconfiginstantiate/rest.go#L219-L227

this is sort of unique to the binary build flow. The binary instantiate api is effectively the real "build client" here (it creates the build object and waits for the build pod and then just relies status back to the oc client). So when the build pod doesn't start running in time, it does it's best to clean up the potential mess by canceling the build.

i'd say that's all reasonable behavior.

Now, could we do any better in terms of reflecting why the build was canceled, in the build object? probably not easily today (I don't think there's a way to pass a reason when canceling a build, we just mark it for cancelation and then the OCM updates the build phase to canceled).

now on to the questions:

1) i don't think we can safely always do that, which is why i elected to clear it out...i worry that leaving the messages in place (especially when changing the condition to false) will lead to confusing information for users). Note that i was also a bit hamstrung by my implementation approach which precluded leaving multiple conditions set (handling multiple conditions being set would have been a much more complicated implementation).

2) yes, the build was canceled (by the binaryinstantiate api flow), i still believe canceling is the right thing to do, so the build should reflect that it was canceled (both in its phase + its conditions)

3) maybe, but it needs to be done carefully. depending on the error we got back, "latest" might be nil or otherwise unhelpful. But yes in this one specific case we could probably look at "latest" and extract the "message" from the "new" phase/condition, and return it as part of the error that goes back to the oc client. Something like ("Gave up waiting for build, build was still in New due to: %s",message)

Comment 11 Gabe Montero 2020-02-05 23:11:02 UTC

thanks for the info Ben

Wen - Based on Ben's feedback I think the risk / reward is too negative to try to optimize this edge error reporting case.

IMO let's hold off on any additional bugzillas here.

Comment 12 wewang 2020-02-06 13:30:51 UTC

Yes, Thanks Gabe and Ben for the explanation.

Comment 14 errata-xmlrpc 2020-05-04 11:27:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.