spec: nodeSelector: null output: {} postCommit: {} resources: {} source: contextDir: images/tools git: ref: tools_image uri: https://github.com/smarterclayton/oc.git type: Git strategy: dockerStrategy: dockerfilePath: Dockerfile type: Docker triggeredBy: null status: completionTimestamp: "2020-03-21T23:49:42Z" duration: 19000000000 message: The pod for this build already exists and is older than the build. 1. Create build 2. Pod is created 3. Delete build and immediately recreate it Expected: Next build waits for old pod (from different build pod UID) to be cleaned up by GC and then continues Actual: Build fails This is a failure mode that is likely to cause issues with scripting.
Interestingly, in both places where that build error is noted, it is already doing own ref checks: https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L700-L702 and https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1235 the owner ref verification logic is at https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/strategy/util.go#L305-L332 I'll paste the code here to facilitate discussion func makeOwnerReference(build *buildv1.Build) metav1.OwnerReference { t := true return metav1.OwnerReference{ APIVersion: BuildControllerRefKind.GroupVersion().String(), Kind: BuildControllerRefKind.Kind, Name: build.Name, UID: build.UID, Controller: &t, } } func setOwnerReference(pod *corev1.Pod, build *buildv1.Build) { pod.OwnerReferences = []metav1.OwnerReference{makeOwnerReference(build)} } // HasOwnerReference returns true if the build pod has an OwnerReference to the // build. func HasOwnerReference(pod *corev1.Pod, build *buildv1.Build) bool { ref := makeOwnerReference(build) for _, r := range pod.OwnerReferences { if reflect.DeepEqual(r, ref) { return true } } return false } And sure enough, unless the UIDs match, it will return false, which means we flag the error. The 2 calls I noted are the only uses of HasOwnerReference in the build controller. So augmenting that call to return 2 booleans 1) one if no owner ref exists 2) one if a matching owner ref is not found Should start the path for augmenting the behavior as proscribed. Presumably waiting would be achieved by returning an error up to chain that tells the controller to requeue the event. However, tracking how long we've waited and when to give up waiting I think steers us away from that approach. The next idea that comes to my mind is to add something like an annotation to the build object with a timestamp of how long we are waiting. We then update the build with that annotation and check things when we receive the next update. And either except the build/pod combo or eventually error out based on how long we decide to wait. Anyway, consider this my initial contribution towards triage. I'll add this one to my bug burndown queue, which I should start tackling again relatively soon. Take back Adam if you already have something cooking in this space. Of course comments on the approach I've articulated prior to any PR generation are welcome.
@Gabe, I use build.yaml[1] to check the bug, but still met BuildPodExists, if I misunderstood something about ownerReferences? could you help to check, thanks Version: 4.5.0-0.nightly-2020-04-23-202137 Steps: ''' #!/bin/bash oc create -f ./build.yaml sleep 10 oc get pod oc delete build bz1815816-1 oc create -f ./build.yaml oc get builds -w ''' result: $oc get build -oyaml status: completionTimestamp: "2020-04-24T10:01:02Z" conditions: - lastTransitionTime: "2020-04-24T10:01:02Z" lastUpdateTime: "2020-04-24T10:01:02Z" status: "False" type: New - lastTransitionTime: "2020-04-24T10:01:02Z" lastUpdateTime: "2020-04-24T10:01:02Z" message: The pod for this build already exists and is older than the build. reason: BuildPodExists status: "True" type: Error duration: 29000000000 message: The pod for this build already exists and is older than the build. output: {} phase: Error reason: BuildPodExists startTimestamp: "2020-04-24T10:00:33Z" [1]http://pastebin.test.redhat.com/858784
if I understand what you did Wen, and your build yaml had ownerReferences: - apiVersion: v1 kind: Build name: bz1815816-1 uid: test from the very beginning then your test case is incorrect. Try it where your build yaml file at the start is only: apiVersion: v1 kind: Build metadata: name: bz1815816-1 spec: source: git: uri: https://github.com/openshift/ruby-hello-world type: Git strategy: sourceStrategy: from: kind: ImageStreamTag name: ruby:latest namespace: openshift type: Source thanks
@Gabe I thought when after delete build, if pod which ownerReferences the build is removed by gc in 2 minutes, new build will not failed, but when I test, it's not my thought, do you correct me, thanks. Version: 4.5.0-0.nightly-2020-04-25-170442 [wewang@wangwen ~]$ cat build.yaml apiVersion: v1 kind: Build metadata: name: bz1815816-1 spec: source: git: uri: https://github.com/openshift/ruby-hello-world type: Git strategy: sourceStrategy: from: kind: ImageStreamTag name: ruby:latest namespace: openshift type: Source [wewang@wangwen ~]$ cat bz1815816.sh #!/bin/bash oc create -f ./build.yaml oc get pod oc get pod -o=jsonpath='{.items[0].metadata.ownerReferences[0]}' oc delete build bz1815816-1 oc create -f ./build.yaml oc get builds -w [wewang@wangwen ~]$ sh bz1815816.sh build.build.openshift.io/bz1815816-1 created NAME READY STATUS RESTARTS AGE bz1815816-1-build 0/1 Init:0/2 0 2s map[apiVersion:build.openshift.io/v1 controller:true kind:Build name:bz1815816-1 uid:68036a75-b241-48a3-972d-f37342233acb] build.build.openshift.io "bz1815816-1" deleted build.build.openshift.io/bz1815816-1 created NAME TYPE FROM STATUS STARTED DURATION bz1815816-1 Source Git Error (BuildPodExists) 10 seconds ago 9s [wewang@wangwen ~]$ oc get build -oyaml status: completionTimestamp: "2020-04-26T08:10:45Z" conditions: - lastTransitionTime: "2020-04-26T08:10:45Z" lastUpdateTime: "2020-04-26T08:10:45Z" status: "False" type: New - lastTransitionTime: "2020-04-26T08:10:45Z" lastUpdateTime: "2020-04-26T08:10:45Z" message: The pod for this build already exists and is older than the build. reason: BuildPodExists status: "True" type: Error duration: 9000000000 logSnippet: The container could not be located when the pod was terminated message: The pod for this build already exists and is older than the build. output: {} phase: Error reason: BuildPodExists startTimestamp: "2020-04-26T08:10:36Z"
Hmmm ... the GC delays may be longer than we think, or the GC is not happening, or there is an issue with the fix. I'll see if I can try this today, but if I don't get to it before you do @Wen, please retry and gather the following data: 1) oc get build bz1815816-1 -o yaml for the original build 2) oc get pod gz18158160-1-build-1 -o yaml for the original build pod 3) delete like before 4) full pod yaml for the build and whatever build pods remain
ok got a chance to try it there do appear to be some timing windows I saw the problem once as Wen noted #Comment 6 But I also did some runs with trace on for example where the test script worked fine. I'll take the bug back and iterate some more on repro's and some same patches for tracing, see what I can uncover.
OK Wen I'm seeing consistent, successful results now after a successive string of test runs with https://github.com/openshift/openshift-controller-manager/pull/96 let's have you try it again once that merges
Fyi, cluster version is: 4.5.0-0.nightly-2020-04-29-025631
@Wen So I logged onto your cluster, and it seems like that 04-29-025631 level does not have the changes. In fact, I don't think it is even in the very latest nightly when I compare 4.5.0-0.nightly-2020-04-29-111042 back for several days. i.e. https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-04-29-111042?from=4.5.0-0.nightly-2020-04-27-151945 OPENSHIFT-CONTROLLER-MANAGER is not in the list. Then, what I found: 1) I looked at the OCM logs as is, before any repro's, and saw these lines: I0429 09:09:25.129628 1 build_controller.go:1612] Build "wewang/bz1815816-1" has been deleted I0429 09:09:25.835532 1 build_controller.go:1612] Build "wewang/bz1815816-1" has been deleted I0429 09:13:56.197205 1 build_controller.go:1612] Build "wewang1/bz1815816-1" has been deleted I0429 09:22:58.818349 1 build_controller.go:1612] Build "wewang/bz1815816-1" has been deleted However in the latest version, that includes my fix, that log is at line 1648 https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1648 2) I then turned on trace for the OCM a) oc edit openshiftcontrollermanager b) logLevel: Trace c) oc get pods -w -n openshift-controller-manager # wait for OCM to restart b) oc get pods -o name -n openshift-controller-manager | xargs -l -r oc logs -n openshift-controller-manager # make sure leader election has occurred 3) observed more log statements with differing line numbers 4) ran my test script, repro'ed the problem, saw more differing line numbers I'll try to ask the ART folks today what is up ... it appears to me at least it was moved from MODIFIED to ON_QA incorrectly.
Yeah it appears the OSE builds for OCM are failing ... see https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=72245 Nothing clean since Apr 27 ART is diving into it for us.
Actually the lack of 4.5 osbs / brew builds for OCM may date back all the way to Apr 14 and changes related to bumping to golang 1.13 Still actively engaged with ART and Justin Pierce on slack. A last resort change might be https://github.com/openshift/openshift-controller-manager/pull/98 if the current theories and investigations fall that way, but ideally those RPMs get included in the golang 1.13 image brew uses.
Good catch @gabe, tested in version as follow(include bug prs), pr98 is not related to the bug, so verified it. 4.5.0-0.nightly-2020-04-30-042256 steps: [wewang@wangwen ~]$ sh bz1815816.sh build.build.openshift.io/bz1815816-1 created NAME READY STATUS RESTARTS AGE bz1815816-1-build 0/1 Init:0/2 0 1s ruby-hello-world-1-build 0/1 Completed 0 147m ruby-hello-world-1-deploy 0/1 Completed 0 146m ruby-hello-world-1-hcf75 1/1 Running 0 146m map[apiVersion:build.openshift.io/v1 controller:true kind:Build name:bz1815816-1 uid:0d3db81a-35ce-4984-8928-8a483dcda7a1] build.build.openshift.io "bz1815816-1" deleted build.build.openshift.io/bz1815816-1 created NAME TYPE FROM STATUS STARTED DURATION ruby-hello-world-1 Source Git@57073c0 Complete 2 hours ago 1m30s bz1815816-1 Source Git New bz1815816-1 Source Git Pending bz1815816-1 Source Git Running 3 seconds ago bz1815816-1 Source Git@57073c0 Running 4 seconds ago bz1815816-1 Source Git@57073c0 Running About a minute ago bz1815816-1 Source Git@57073c0 Complete About a minute ago 1m9s
Awesome, thanks @Wen And to corroborate, if you look at the report https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-04-30-084834?from=4.5.0-0.nightly-2020-04-27-192810 you'll now see a bunch of our recent OCM changes. openshift-controller-manager Bug 1825324: Use named queues for all controllers #95 Bug 1815816: more timing windows on build/pod owner ref correlation (allow GC to c… #96 Bug 1815816: add 2 min retry on create build existing pod wrong owner (allow for GC) #94 Bug 1801415: Migrate extensions/v1beta1 -> networking.k8s.io/v1beta1 #83 Bug 1813233: DEVEXP-533 - ocm k8s 118; prometheus bump; *-go bump #85 the changes Justin and the ART folks made did the trick.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409