1815816 – Build fails with error when old pod is being deleted by garbage collection

Bug 1815816 - Build fails with error when old pod is being deleted by garbage collection

Summary: Build fails with error when old pod is being deleted by garbage collection

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Gabe Montero
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-21 23:52 UTC by Clayton Coleman
Modified:	2020-07-13 17:23 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: the build controller lacked tolerance for usage scenarios like where a build was started, delete it, and then recreate it (where underlying pod garbage collection had not yet occurred) Consequence: final build attempts in such scenarios failed when they could have succeeded with a little patience Fix: the build controller was augmented to wait up to 2 minutes for a build pod's owner reference to match the latest version of a build Result: scenarios common to automated scripting like build create, delete, recreate are now possible with openshift builds
Clone Of:
Environment:
Last Closed:	2020-07-13 17:23:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift openshift-controller-manager pull 94	None	closed	Bug 1815816: add 2 min retry on create build existing pod wrong owner (allow for GC)	2020-06-23 09:32:07 UTC
Github	openshift openshift-controller-manager pull 96	None	closed	Bug 1815816: more timing windows on build/pod owner ref correlation (allow GC to c…	2020-06-23 09:32:07 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:23:51 UTC

Description Clayton Coleman 2020-03-21 23:52:26 UTC

spec:
  nodeSelector: null
  output: {}
  postCommit: {}
  resources: {}
  source:
    contextDir: images/tools
    git:
      ref: tools_image
      uri: https://github.com/smarterclayton/oc.git
    type: Git
  strategy:
    dockerStrategy:
      dockerfilePath: Dockerfile
    type: Docker
  triggeredBy: null
status:
  completionTimestamp: "2020-03-21T23:49:42Z"
  duration: 19000000000
  message: The pod for this build already exists and is older than the build.


1. Create build
2. Pod is created
3. Delete build and immediately recreate it

Expected: Next build waits for old pod (from different build pod UID) to be cleaned up by GC and then continues

Actual: Build fails

This is a failure mode that is likely to cause issues with scripting.

Comment 1 Gabe Montero 2020-04-14 15:52:02 UTC

Interestingly, in both places where that build error is noted, it is already doing own ref checks:

https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L700-L702

and 

https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1235

the owner ref verification logic is at https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/strategy/util.go#L305-L332

I'll paste the code here to facilitate discussion 

func makeOwnerReference(build *buildv1.Build) metav1.OwnerReference {
	t := true
	return metav1.OwnerReference{
		APIVersion: BuildControllerRefKind.GroupVersion().String(),
		Kind:       BuildControllerRefKind.Kind,
		Name:       build.Name,
		UID:        build.UID,
		Controller: &t,
	}
}

func setOwnerReference(pod *corev1.Pod, build *buildv1.Build) {
	pod.OwnerReferences = []metav1.OwnerReference{makeOwnerReference(build)}
}

// HasOwnerReference returns true if the build pod has an OwnerReference to the
// build.
func HasOwnerReference(pod *corev1.Pod, build *buildv1.Build) bool {
	ref := makeOwnerReference(build)

	for _, r := range pod.OwnerReferences {
		if reflect.DeepEqual(r, ref) {
			return true
		}
	}

	return false
}

And sure enough, unless the UIDs match, it will return false, which means we flag the error.

The 2 calls I noted are the only uses of HasOwnerReference in the build controller.

So augmenting that call to return 2 booleans
1) one if no owner ref exists
2) one if a matching owner ref is not found

Should start the path for augmenting the behavior as proscribed.

Presumably waiting would be achieved by returning an error up to chain that tells the controller to requeue the event. 

However, tracking how long we've waited and when to give up waiting I think steers us away from that approach.

The next idea that comes to my mind is to add something like an annotation to the build object with a timestamp of how long we are waiting.
We then update the build with that annotation and check things when we receive the next update.  And either except the build/pod combo or 
eventually error out based on how long we decide to wait.

Anyway, consider this my initial contribution towards triage.

I'll add this one to my bug burndown queue, which I should start tackling again relatively soon.  Take back Adam if you already have
something cooking in this space.

Of course comments on the approach I've articulated prior to any PR generation are welcome.

Comment 4 wewang 2020-04-24 10:23:54 UTC

@Gabe, I use build.yaml[1] to check the bug, but still met BuildPodExists, if I misunderstood something about ownerReferences? could you help to check, thanks
Version:
4.5.0-0.nightly-2020-04-23-202137

Steps:
'''
#!/bin/bash
oc create -f ./build.yaml
sleep 10
oc get pod
oc delete build bz1815816-1
oc create -f ./build.yaml
oc get builds -w
'''
result: 
$oc get build -oyaml

  status:
    completionTimestamp: "2020-04-24T10:01:02Z"
    conditions:
    - lastTransitionTime: "2020-04-24T10:01:02Z"
      lastUpdateTime: "2020-04-24T10:01:02Z"
      status: "False"
      type: New
    - lastTransitionTime: "2020-04-24T10:01:02Z"
      lastUpdateTime: "2020-04-24T10:01:02Z"
      message: The pod for this build already exists and is older than the build.
      reason: BuildPodExists
      status: "True"
      type: Error
    duration: 29000000000
    message: The pod for this build already exists and is older than the build.
    output: {}
    phase: Error
    reason: BuildPodExists
    startTimestamp: "2020-04-24T10:00:33Z"

[1]http://pastebin.test.redhat.com/858784

Comment 5 Gabe Montero 2020-04-24 12:09:54 UTC

if I understand what you did Wen, and your build yaml had

  ownerReferences:
  - apiVersion: v1
    kind: Build
    name: bz1815816-1
    uid: test

from the very beginning

then your test case is incorrect.

Try it where your build yaml file at the start is only:

apiVersion: v1
kind: Build
metadata:
  name: bz1815816-1
spec:
  source:
    git:
      uri: https://github.com/openshift/ruby-hello-world
    type: Git
  strategy:
    sourceStrategy:
      from:
        kind: ImageStreamTag
        name: ruby:latest
        namespace: openshift
    type: Source


thanks

Comment 6 wewang 2020-04-26 08:20:09 UTC

@Gabe I thought when after delete build, if pod which ownerReferences the build is removed by gc in 2 minutes, new build will not failed, but when I test, it's not my thought, do you correct me, thanks.

Version: 4.5.0-0.nightly-2020-04-25-170442


[wewang@wangwen ~]$ cat build.yaml
apiVersion: v1
kind: Build
metadata:
  name: bz1815816-1
spec:
  source:
    git:
      uri: https://github.com/openshift/ruby-hello-world
    type: Git
  strategy:
    sourceStrategy:
      from:
        kind: ImageStreamTag
        name: ruby:latest
        namespace: openshift
    type: Source

[wewang@wangwen ~]$ cat bz1815816.sh 
#!/bin/bash
oc create -f ./build.yaml
oc get pod
oc get pod -o=jsonpath='{.items[0].metadata.ownerReferences[0]}'
oc delete build bz1815816-1
oc create -f ./build.yaml
oc get builds -w

[wewang@wangwen ~]$ sh  bz1815816.sh 
build.build.openshift.io/bz1815816-1 created

NAME                READY   STATUS     RESTARTS   AGE
bz1815816-1-build   0/1     Init:0/2   0          2s

map[apiVersion:build.openshift.io/v1 controller:true kind:Build name:bz1815816-1 uid:68036a75-b241-48a3-972d-f37342233acb]

build.build.openshift.io "bz1815816-1" deleted

build.build.openshift.io/bz1815816-1 created
NAME          TYPE     FROM   STATUS                   STARTED          DURATION
bz1815816-1   Source   Git    Error (BuildPodExists)   10 seconds ago   9s

[wewang@wangwen ~]$ oc get build -oyaml

  status:
    completionTimestamp: "2020-04-26T08:10:45Z"
    conditions:
    - lastTransitionTime: "2020-04-26T08:10:45Z"
      lastUpdateTime: "2020-04-26T08:10:45Z"
      status: "False"
      type: New
    - lastTransitionTime: "2020-04-26T08:10:45Z"
      lastUpdateTime: "2020-04-26T08:10:45Z"
      message: The pod for this build already exists and is older than the build.
      reason: BuildPodExists
      status: "True"
      type: Error
    duration: 9000000000
    logSnippet: The container could not be located when the pod was terminated
    message: The pod for this build already exists and is older than the build.
    output: {}
    phase: Error
    reason: BuildPodExists
    startTimestamp: "2020-04-26T08:10:36Z"

Comment 7 Gabe Montero 2020-04-27 14:01:59 UTC

Hmmm ... the GC delays may be longer than we think, or the GC is not happening, or there is an issue with the fix.

I'll see if I can try this today, but if I don't get to it before you do @Wen, please retry and gather the following data:

1) oc get build bz1815816-1 -o yaml 
for the original build 
2) oc get pod gz18158160-1-build-1 -o yaml 
for the original build pod
3) delete like before

4) full pod yaml for the build and whatever build pods remain

Comment 8 Gabe Montero 2020-04-27 21:07:29 UTC

ok got a chance to try it 

there do appear to be some timing windows

I saw the problem once as Wen noted #Comment 6

But I also did some runs with trace on for example where the test script worked fine.

I'll take the bug back and iterate some more on repro's and some same patches for tracing,
see what I can uncover.

Comment 9 Gabe Montero 2020-04-27 21:51:52 UTC

OK Wen I'm seeing consistent, successful results now after a successive string of test runs with https://github.com/openshift/openshift-controller-manager/pull/96

let's have you try it again once that merges

Comment 12 wewang 2020-04-29 09:31:39 UTC

Fyi, cluster version is:
4.5.0-0.nightly-2020-04-29-025631

Comment 13 Gabe Montero 2020-04-29 14:19:06 UTC

@Wen

So I logged onto your cluster, and it seems like that 04-29-025631 level does not have the changes.
In fact, I don't think it is even in the very latest nightly when I compare 4.5.0-0.nightly-2020-04-29-111042 back for several days.
i.e. https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-04-29-111042?from=4.5.0-0.nightly-2020-04-27-151945
OPENSHIFT-CONTROLLER-MANAGER is not in the list.

Then, what I found:

1) I looked at the OCM logs as is, before any repro's, and saw these lines:

I0429 09:09:25.129628       1 build_controller.go:1612] Build "wewang/bz1815816-1" has been deleted
I0429 09:09:25.835532       1 build_controller.go:1612] Build "wewang/bz1815816-1" has been deleted
I0429 09:13:56.197205       1 build_controller.go:1612] Build "wewang1/bz1815816-1" has been deleted
I0429 09:22:58.818349       1 build_controller.go:1612] Build "wewang/bz1815816-1" has been deleted


However in the latest version, that includes my fix, that log is at line 1648

https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L1648

2) I then turned on trace for the OCM
  a) oc edit openshiftcontrollermanager
  b) logLevel: Trace
  c) oc get pods -w -n openshift-controller-manager # wait for OCM to restart
  b) oc get pods -o name -n openshift-controller-manager | xargs -l -r oc logs -n openshift-controller-manager  # make sure leader election has occurred

3) observed more log statements with differing line numbers

4) ran my test script, repro'ed the problem, saw more differing line numbers

I'll try to ask the ART folks today what is up ... it appears to me at least it was moved from MODIFIED to ON_QA incorrectly.

Comment 14 Gabe Montero 2020-04-29 14:44:27 UTC

Yeah it appears the OSE builds for OCM are failing ... see https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=72245
Nothing clean since Apr 27

ART is diving into it for us.

Comment 15 Gabe Montero 2020-04-29 18:15:56 UTC

Actually the lack of 4.5 osbs / brew builds for OCM may date back all the way to Apr 14 and changes related to bumping to golang 1.13

Still actively engaged with ART and Justin Pierce on slack.

A last resort change might be https://github.com/openshift/openshift-controller-manager/pull/98 if the current theories and investigations fall that way,
but ideally those RPMs get included in the golang 1.13 image brew uses.

Comment 16 wewang 2020-04-30 10:52:10 UTC

Good catch @gabe, tested in  version as follow(include bug prs), pr98 is not related to the bug, so verified it.
4.5.0-0.nightly-2020-04-30-042256

steps:
[wewang@wangwen ~]$ sh bz1815816.sh 
build.build.openshift.io/bz1815816-1 created
NAME                        READY   STATUS      RESTARTS   AGE
bz1815816-1-build           0/1     Init:0/2    0          1s
ruby-hello-world-1-build    0/1     Completed   0          147m
ruby-hello-world-1-deploy   0/1     Completed   0          146m
ruby-hello-world-1-hcf75    1/1     Running     0          146m
map[apiVersion:build.openshift.io/v1 controller:true kind:Build name:bz1815816-1 uid:0d3db81a-35ce-4984-8928-8a483dcda7a1]
build.build.openshift.io "bz1815816-1" deleted
build.build.openshift.io/bz1815816-1 created
NAME                 TYPE     FROM          STATUS     STARTED       DURATION
ruby-hello-world-1   Source   Git@57073c0   Complete   2 hours ago   1m30s
bz1815816-1          Source   Git           New                      
bz1815816-1          Source   Git           Pending                  
bz1815816-1          Source   Git           Running    3 seconds ago   
bz1815816-1          Source   Git@57073c0   Running    4 seconds ago   
bz1815816-1          Source   Git@57073c0   Running    About a minute ago   
bz1815816-1          Source   Git@57073c0   Complete   About a minute ago   1m9s

Comment 17 Gabe Montero 2020-04-30 13:10:42 UTC

Awesome, thanks @Wen

And to corroborate, if you look at the report https://openshift-release.svc.ci.openshift.org/releasestream/4.5.0-0.nightly/release/4.5.0-0.nightly-2020-04-30-084834?from=4.5.0-0.nightly-2020-04-27-192810
you'll now see a bunch of our recent OCM changes.

openshift-controller-manager

    Bug 1825324: Use named queues for all controllers #95
    Bug 1815816: more timing windows on build/pod owner ref correlation (allow GC to c… #96
    Bug 1815816: add 2 min retry on create build existing pod wrong owner (allow for GC) #94
    Bug 1801415: Migrate extensions/v1beta1 -&gt; networking.k8s.io/v1beta1 #83
    Bug 1813233: DEVEXP-533 - ocm k8s 118; prometheus bump; *-go bump #85


the changes Justin and the ART folks made did the trick.

Comment 19 errata-xmlrpc 2020-07-13 17:23:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.