Description of problem: The installation process is completing successfully and the machine-api CO reports as available to true but oc describe says degraded/progressing. Also the machine-api-controller pod is crashlooping Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-04-16-084508 True False 40m Cluster version is 4.4.0-0.nightly-2020-04-16-084508 How reproducible: Always Steps to Reproduce: 1. Install a cluster with nightly build - 4.4.0-0.nightly-2020-04-16-084508 2. After the install, do #oc get co machine-api NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-api 4.4.0-0.nightly-2020-04-16-084508 True False False 67m #oc describe co machine-api Name: machine-api Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-04-16T15:46:21Z Generation: 1 Resource Version: 37158 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-api UID: 0c227449-c491-4fe6-9ec7-5bf84aa54895 Spec: Status: Conditions: Last Transition Time: 2020-04-16T15:46:38Z Message: Running resync for operator: 4.4.0-0.nightly-2020-04-16-084508 Reason: SyncingResources Status: False Type: Progressing Last Transition Time: 2020-04-16T15:46:21Z Status: True Type: Available Last Transition Time: 2020-04-16T16:50:27Z Status: False Type: Degraded Last Transition Time: 2020-04-16T15:46:21Z Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: Name: openshift-machine-api Resource: namespaces Group: machine.openshift.io Name: Namespace: openshift-machine-api Resource: machines Group: machine.openshift.io Name: Namespace: openshift-machine-api Resource: machinesets Group: rbac.authorization.k8s.io Name: Namespace: openshift-machine-api Resource: roles Group: rbac.authorization.k8s.io Name: machine-api-operator Resource: clusterroles Group: rbac.authorization.k8s.io Name: machine-api-controllers Resource: clusterroles Group: rbac.authorization.k8s.io Name: cloud-provider-config-reader Namespace: openshift-config Resource: roles Versions: Name: operator Version: 4.4.0-0.nightly-2020-04-16-084508 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Status upgrade 68m machineapioperator Progressing towards operator: 4.4.0-0.nightly-2020-04-16-084508 Warning Status degraded 4m14s (x6 over 30m) machineapioperator deployment machine-api-controllers is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1) # oc get pods -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-d8bcfd97f-qlvkf 2/2 Running 0 63m machine-api-controllers-76b8d649d6-v4v6d 3/4 CrashLoopBackOff 12 68m machine-api-operator-9fbd675fc-rz5sv 2/2 Running 1 73m Actual results: oc get co reports "available" oc describe reports "Normal Status upgrade 68m machineapioperator Progressing towards operator: 4.4.0-0.nightly-2020-04-16-084508 Warning Status degraded 4m14s (x6 over 30m) machineapioperator deployment machine-api-controllers is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)" oc get pods show that machine-api-controllers is crash-looping Expected results: machine-api should be available and pods should be in Running Status Additional info: related bug on 4.5 - https://bugzilla.redhat.com/show_bug.cgi?id=1812800 Logs from must gather are here: http://file.rdu.redhat.com/schituku/bug-logs/bug-1824943/must-gather-logs.tar.gz
The root cause making the controller break is 2020-04-16T16:50:35.2281397Z I0416 16:50:35.228108 1 publicips.go:57] creating public ip sch-02-4jc7g-sch-02-4jc7g-workload-centralus1-jpzrj 2020-04-16T16:50:35.2282496Z E0416 16:50:35.228208 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) This is fixed in master (4.5) https://bugzilla.redhat.com/show_bug.cgi?id=1809001 And there's a PR for 4.4 https://bugzilla.redhat.com/show_bug.cgi?id=1809521 So the operator status is "legitimately" flipping between degraded = false / true as the controller comes up and then breaks while available remains true. This is usually fine as after available is true, only a payload upgrade would make the DeploymentRollout to fail (degraded true) while the existing one is still operational. We should try to come up with some smarter logic which account for this bz particular scenario where flipping is not actually a good UX and possibly set degraded = true and available = false until the controller is operational for reasonable timeframe.
*** Bug 1826553 has been marked as a duplicate of this bug. ***
PR is merged [1]; moving to MODIFIED. [1]: https://github.com/openshift/machine-api-operator/pull/561#event-3256381463
Verified clusterversion: 4.5.0-0.nightly-2020-04-27-204255 $ oc describe co machine-api Name: machine-api Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-04-28T02:43:10Z Generation: 1 Resource Version: 131501 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-api UID: 1ace15bb-8a86-47c5-9156-66a9c1f6109b Spec: Status: Conditions: Last Transition Time: 2020-04-28T02:56:42Z Status: False Type: Progressing Last Transition Time: 2020-04-28T02:53:22Z Status: False Type: Degraded Last Transition Time: 2020-04-28T02:56:42Z Message: Cluster Machine API Operator is available at operator: 4.5.0-0.nightly-2020-04-27-204255 Status: True Type: Available Last Transition Time: 2020-04-28T02:53:22Z Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: Name: openshift-machine-api Resource: namespaces Group: machine.openshift.io Name: Namespace: openshift-machine-api Resource: machines Group: machine.openshift.io Name: Namespace: openshift-machine-api Resource: machinesets Group: rbac.authorization.k8s.io Name: Namespace: openshift-machine-api Resource: roles Group: rbac.authorization.k8s.io Name: machine-api-operator Resource: clusterroles Group: rbac.authorization.k8s.io Name: machine-api-controllers Resource: clusterroles Versions: Name: operator Version: 4.5.0-0.nightly-2020-04-27-204255 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Status upgrade 4h58m machineapioperator Progressing towards operator: 4.5.0-0.nightly-2020-04-27-204255 $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-99c6647f8-7nwc2 2/2 Running 0 4h47m machine-api-controllers-648449b654-kjhvt 4/4 Running 0 4h43m machine-api-operator-f6f66d5c7-ktzhr 2/2 Running 0 4h43m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409