Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1608112 - Upgraded 3.11 cluster panics in controller startup for build
Upgraded 3.11 cluster panics in controller startup for build
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build (Show other bugs)
3.11.0
Unspecified Unspecified
unspecified Severity urgent
: ---
: 3.11.0
Assigned To: Ben Parees
wewang
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-07-24 21:42 EDT by Clayton Coleman
Modified: 2018-10-11 03:22 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-10-11 03:22:06 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 None None None 2018-10-11 03:22 EDT

  None (edit)
Description Clayton Coleman 2018-07-24 21:42:51 EDT
Updated api.ci, got a panic in the build controller

E0725 01:39:57.220107       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/panic.go:63
/usr/local/go/src/runtime/signal_unix.go:388
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:1020
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:1043
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:366
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:285
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:261
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:246
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:2361
Comment 1 Clayton Coleman 2018-07-24 21:43:35 EDT
This is to latest master branch images in origin.
Comment 2 Clayton Coleman 2018-07-24 21:53:20 EDT
The panic is because handleCompletedBuild() can be passed a nil pod if the pod has been deleted already.
Comment 3 Clayton Coleman 2018-07-24 21:53:20 EDT
The panic is because handleCompletedBuild() can be passed a nil pod if the pod has been deleted already.
Comment 4 Clayton Coleman 2018-07-24 21:57:41 EDT
Fix in https://github.com/openshift/origin/pull/20414
Comment 5 wewang 2018-08-23 05:02:21 EDT
Hi, @Clayton Coleman, In my understand, 
when build is in terminal status(success, failed or cancel), then delete build pod, should no error about Observed a panic according: https://github.com/smarterclayton/origin/blob/5d12941b2ee9da5ce17c9ec296f483f363182dd5/pkg/build/controller/build/build_controller.go#L1012, right?

but I want to reproduce the bug in v3.10, I tried with steps

1. Create apps
$ oc new-app -f https://raw.githubusercontent.com/openshift/origin/master/examples/sample-app/application-template-dockerbuild.json
 $ oc get builds
NAME                  TYPE      FROM          STATUS    STARTED         DURATION
ruby-sample-build-1   Source    Git@7ccd324   Running   7 seconds ago   
2. cancel build
 $oc cancel-build ruby-sample-build-1
3.oc delete pod ruby-sample-build-1-build
4. check the logs of bc
  $oc logs -f bc/ruby-sample-build

how can i get log for build controller?
Comment 6 Ben Parees 2018-08-23 11:22:23 EDT
the build controller logs are part of the master logs.

(if you have a separate api server and controller server, then they are part of the controller server).
Comment 7 wewang 2018-08-27 05:55:09 EDT
Hi @Ben Parees, I want to reproduce it in openshift v3.10.35 with steps Comment 5, 
but cannot get error info "Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)", 
could you help to check if my steps need to update? I think it's hard to decide when to delete build pod before setBuildCompletionData. thanks so much:>
Comment 8 wewang 2018-08-27 09:35:20 EDT
Fyi, pls ignore step 4, I checked the build controller log in controller server
Comment 9 Ben Parees 2018-08-27 15:32:27 EDT
I think you can recreate this by just starting a build and as soon as the build starts running, deleting the build pod.
Comment 10 wewang 2018-08-28 03:20:02 EDT
@Ben Parees, 
a. I know why cannot reproduce it in openshift v3.10.35, because there is no isOOMKilled() function in buildconfig_controller.go.

b. But in v3.11.0-0.10.0, it had isOOMKilled (https://github.com/openshift/ose/blob/v3.11.0-0.10.0/pkg/build/controller/build/build_controller.go#L1020), which is not include pr 20414, still cannot reproduce issue about "Observed a panic".you can check my steps as follow.

c. Should I Update api.ci to test the bug? I am not familiar with that, 
or we just checked the latest v3.11(v3.11.0-0.24.0) with my steps, if no panic issue in build controller.

here's my steps:
1. start a build
 $oc new-build https://github.com/openshift/ruby-hello-world

2. When build is running, delete build pod
$oc delete pod ruby-hello-world-1-build

3. Check the build
$ oc get  builds 
ruby-hello-world-1   Docker    Git@7ccd324   Error (BuildPodDeleted)   15 seconds ago   8s

4.Check the build controller log in controller server
root@qe-wewang2-bugcheckmaster-etcd-nfs-1 ~]# oc logs pod/master-controllers-qe-wewang2-bugcheckmaster-etcd-nfs-1 -n kube-system --loglevel=8  |grep -i panic
Comment 12 Ben Parees 2018-08-28 14:34:36 EDT
After deleting the build pod, edit the build object (change an annotation or something).  That should force the build through the "handleCompletedBuild" codepath which will attempt to reference the pod trigger the nil pointer error.
Comment 13 wewang 2018-08-29 04:36:50 EDT
Finally, could reproduce the issue in openshift:v3.11.0-0.10.0, and verified it in openshift v3.11.0-0.24.0, Thanks Ben Parees!

Here's my reproduce steps:
1. Create a build
 $oc new-build https://github.com/openshift/ruby-hello-world

2.When build is complete, edit the build object 
$ oc get builds
NAME                 TYPE      FROM          STATUS     STARTED              DURATION
ruby-hello-world-1   Docker    Git@7ccd324   Complete   About a minute ago   1m2s
$ oc edit build ruby-hello-world-1
status:
  completionTimestamp: 2018-08-29T08:19:40Z  #delete completionTimestamp
 
3. Delete  build pod
$ oc delete pod ruby-hello-world-1-build

4. Then completionTimestamp should created automatically in build object, then delete it manually, then can force build through the "handleCompletedBuild" codepath
$oc edit build ruby-hello-world-1
status:
  completionTimestamp: 2018-08-29T08:24:34Z # delete completionTimestamp

5. Check build controller log in control server
$oc logs pod/master-controllers-qe-wewang2-bugcheckmaster-etcd-nfs-1 -n kube-system --loglevel=8 |grep -A 15 -B 3 pointer

$I0829 08:14:37.089836       1 build_controller.go:344] Handling build wen/ruby-hello-world-1 (Complete)
E0829 08:14:37.090007       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/builddir/build/BUILD/atomic-openshift-git-0.766dbc4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
Comment 15 errata-xmlrpc 2018-10-11 03:22:06 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Note You need to log in before you can comment on or make changes to this bug.