Bug 1747246

Summary: [osp] machine-api-controller pod stuck in CrashLoopBackOff
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Andrew McDermott <amcdermo>
Status: CLOSED ERRATA QA Contact: sunzhaohua <zhsun>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.2.0CC: agarcial, jhou
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: osp
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:38:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sunzhaohua 2019-08-30 01:10:48 UTC
Description of problem:
Keep a cluster running for a while, machine-api-controller pod stuck in CrashLoopBackOff. Have met this in 2 clusters.

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-14-211610


How reproducible:
Sometimes

Steps to Reproduce:
1.  Running a cluster for a while
2. Check machine-api-controller pod


Actual results:
machine-api-controller pod stuck in CrashLoopBackOff.

$ oc get pod
NAME                                          READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-fff44d57f-vrgcp   1/1     Running            0          17h
machine-api-controllers-5fb9f8f668-gw6xt      3/4     CrashLoopBackOff   46         17h
machine-api-operator-6f89c74764-khxbr         1/1     Running            1          17h

$ oc edit pod machine-api-controllers-5fb9f8f668-gw6xt
    name: nodelink-controller
    ready: false
    restartCount: 47
    state:
      waiting:
        message: Back-off 5m0s restarting failed container=nodelink-controller pod=machine-api-controllers-5fb9f8f668-gw6xt_openshift-machine-api(3d500681-c9ab-11e9-ad00-fa163ea99686)
        reason: CrashLoopBackOff


$ oc logs -f machine-api-controllers-5fb9f8f668-gw6xt -c nodelink-controller
I0829 08:47:03.348079       1 nodelink_controller.go:92] Adding internal IP "192.168.0.34" for node "preserve-groupg-4cf4r-master-1" to indexer
I0829 08:47:03.396320       1 nodelink_controller.go:188] Reconciling Node /preserve-groupg-4cf4r-worker-f6ffp
I0829 08:47:03.396606       1 nodelink_controller.go:409] Finding machine from node "preserve-groupg-4cf4r-worker-f6ffp"
I0829 08:47:03.396659       1 nodelink_controller.go:426] Finding machine from node "preserve-groupg-4cf4r-worker-f6ffp" by ProviderID
I0829 08:47:03.396721       1 nodelink_controller.go:449] Finding machine from node "preserve-groupg-4cf4r-worker-f6ffp" by IP
I0829 08:47:03.396764       1 nodelink_controller.go:454] Found internal IP for node "preserve-groupg-4cf4r-worker-f6ffp": "192.168.0.26"
E0829 08:47:03.397032       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:522
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:82
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/signal_unix.go:390
/go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/meta.go:135
/go/src/github.com/openshift/machine-api-operator/pkg/controller/nodelink/nodelink_controller.go:205
/go/src/github.com/openshift/machine-api-operator/pkg/controller/nodelink/nodelink_controller.go:205
/go/src/github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210
/go/src/github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
/go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/asm_amd64.s:1333
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x10753f7]

goroutine 438 [running]:
github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x108
panic(0x11c6a80, 0x1fee7d0)
        /opt/rh/go-toolset-1.11/root/usr/lib/go-toolset-1.11-golang/src/runtime/panic.go:513 +0x1b9
github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/apis/meta/v1.(*ObjectMeta).GetName(...)
        /go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/apis/meta/v1/meta.go:135
github.com/openshift/machine-api-operator/pkg/controller/nodelink.(*ReconcileNodeLink).findMachineFromNode(0xc000443230, 0xc0003d6580, 0xc0000c6008, 0x0, 0x0)
        /go/src/github.com/openshift/machine-api-operator/pkg/controller/nodelink/nodelink_controller.go:420 +0x267
github.com/openshift/machine-api-operator/pkg/controller/nodelink.(*ReconcileNodeLink).Reconcile(0xc000443230, 0x0, 0x0, 0xc000044e40, 0x22, 0x2001c40, 0xc0001ced50, 0x444e37, 0x8)
        /go/src/github.com/openshift/machine-api-operator/pkg/controller/nodelink/nodelink_controller.go:205 +0x324
github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000ce1e0, 0x0)
        /go/src/github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:210 +0x17d
github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
        /go/src/github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36
github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000514ac0)
        /go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000514ac0, 0x3b9aca00, 0x0, 0x13b2201, 0xc0003d8120)
        /go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xbe
github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000514ac0, 0x3b9aca00, 0xc0003d8120)
        /go/src/github.com/openshift/machine-api-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
        /go/src/github.com/openshift/machine-api-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x32a

Expected results:
machine-api-controller pod work normally.

Additional info:

Comment 4 sunzhaohua 2019-09-03 09:17:35 UTC
Verified.

clusterversion: 4.2.0-0.nightly-2019-09-02-172410

Didn't met this issue again, mark as verified.

Comment 5 errata-xmlrpc 2019-10-16 06:38:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922