Bug 1608112 - Upgraded 3.11 cluster panics in controller startup for build
Summary: Upgraded 3.11 cluster panics in controller startup for build
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.11.0
Assignee: Ben Parees
QA Contact: wewang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-25 01:42 UTC by Clayton Coleman
Modified: 2018-10-11 07:22 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-11 07:22:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:22:29 UTC

Description Clayton Coleman 2018-07-25 01:42:51 UTC
Updated api.ci, got a panic in the build controller

E0725 01:39:57.220107       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/asm_amd64.s:573
/usr/local/go/src/runtime/panic.go:502
/usr/local/go/src/runtime/panic.go:63
/usr/local/go/src/runtime/signal_unix.go:388
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:1020
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:1043
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:366
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:285
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:261
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/pkg/build/controller/build/build_controller.go:246
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:2361

Comment 1 Clayton Coleman 2018-07-25 01:43:35 UTC
This is to latest master branch images in origin.

Comment 2 Clayton Coleman 2018-07-25 01:53:20 UTC
The panic is because handleCompletedBuild() can be passed a nil pod if the pod has been deleted already.

Comment 3 Clayton Coleman 2018-07-25 01:53:20 UTC
The panic is because handleCompletedBuild() can be passed a nil pod if the pod has been deleted already.

Comment 4 Clayton Coleman 2018-07-25 01:57:41 UTC
Fix in https://github.com/openshift/origin/pull/20414

Comment 5 wewang 2018-08-23 09:02:21 UTC
Hi, @Clayton Coleman, In my understand, 
when build is in terminal status(success, failed or cancel), then delete build pod, should no error about Observed a panic according: https://github.com/smarterclayton/origin/blob/5d12941b2ee9da5ce17c9ec296f483f363182dd5/pkg/build/controller/build/build_controller.go#L1012, right?

but I want to reproduce the bug in v3.10, I tried with steps

1. Create apps
$ oc new-app -f https://raw.githubusercontent.com/openshift/origin/master/examples/sample-app/application-template-dockerbuild.json
 $ oc get builds
NAME                  TYPE      FROM          STATUS    STARTED         DURATION
ruby-sample-build-1   Source    Git@7ccd324   Running   7 seconds ago   
2. cancel build
 $oc cancel-build ruby-sample-build-1
3.oc delete pod ruby-sample-build-1-build
4. check the logs of bc
  $oc logs -f bc/ruby-sample-build

how can i get log for build controller?

Comment 6 Ben Parees 2018-08-23 15:22:23 UTC
the build controller logs are part of the master logs.

(if you have a separate api server and controller server, then they are part of the controller server).

Comment 7 wewang 2018-08-27 09:55:09 UTC
Hi @Ben Parees, I want to reproduce it in openshift v3.10.35 with steps Comment 5, 
but cannot get error info "Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)", 
could you help to check if my steps need to update? I think it's hard to decide when to delete build pod before setBuildCompletionData. thanks so much:>

Comment 8 wewang 2018-08-27 13:35:20 UTC
Fyi, pls ignore step 4, I checked the build controller log in controller server

Comment 9 Ben Parees 2018-08-27 19:32:27 UTC
I think you can recreate this by just starting a build and as soon as the build starts running, deleting the build pod.

Comment 10 wewang 2018-08-28 07:20:02 UTC
@Ben Parees, 
a. I know why cannot reproduce it in openshift v3.10.35, because there is no isOOMKilled() function in buildconfig_controller.go.

b. But in v3.11.0-0.10.0, it had isOOMKilled (https://github.com/openshift/ose/blob/v3.11.0-0.10.0/pkg/build/controller/build/build_controller.go#L1020), which is not include pr 20414, still cannot reproduce issue about "Observed a panic".you can check my steps as follow.

c. Should I Update api.ci to test the bug? I am not familiar with that, 
or we just checked the latest v3.11(v3.11.0-0.24.0) with my steps, if no panic issue in build controller.

here's my steps:
1. start a build
 $oc new-build https://github.com/openshift/ruby-hello-world

2. When build is running, delete build pod
$oc delete pod ruby-hello-world-1-build

3. Check the build
$ oc get  builds 
ruby-hello-world-1   Docker    Git@7ccd324   Error (BuildPodDeleted)   15 seconds ago   8s

4.Check the build controller log in controller server
root@qe-wewang2-bugcheckmaster-etcd-nfs-1 ~]# oc logs pod/master-controllers-qe-wewang2-bugcheckmaster-etcd-nfs-1 -n kube-system --loglevel=8  |grep -i panic

Comment 12 Ben Parees 2018-08-28 18:34:36 UTC
After deleting the build pod, edit the build object (change an annotation or something).  That should force the build through the "handleCompletedBuild" codepath which will attempt to reference the pod trigger the nil pointer error.

Comment 13 wewang 2018-08-29 08:36:50 UTC
Finally, could reproduce the issue in openshift:v3.11.0-0.10.0, and verified it in openshift v3.11.0-0.24.0, Thanks Ben Parees!

Here's my reproduce steps:
1. Create a build
 $oc new-build https://github.com/openshift/ruby-hello-world

2.When build is complete, edit the build object 
$ oc get builds
NAME                 TYPE      FROM          STATUS     STARTED              DURATION
ruby-hello-world-1   Docker    Git@7ccd324   Complete   About a minute ago   1m2s
$ oc edit build ruby-hello-world-1
status:
  completionTimestamp: 2018-08-29T08:19:40Z  #delete completionTimestamp
 
3. Delete  build pod
$ oc delete pod ruby-hello-world-1-build

4. Then completionTimestamp should created automatically in build object, then delete it manually, then can force build through the "handleCompletedBuild" codepath
$oc edit build ruby-hello-world-1
status:
  completionTimestamp: 2018-08-29T08:24:34Z # delete completionTimestamp

5. Check build controller log in control server
$oc logs pod/master-controllers-qe-wewang2-bugcheckmaster-etcd-nfs-1 -n kube-system --loglevel=8 |grep -A 15 -B 3 pointer

$I0829 08:14:37.089836       1 build_controller.go:344] Handling build wen/ruby-hello-world-1 (Complete)
E0829 08:14:37.090007       1 runtime.go:66] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/builddir/build/BUILD/atomic-openshift-git-0.766dbc4/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72

Comment 15 errata-xmlrpc 2018-10-11 07:22:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.