Bug 1824943
Summary: | machine-api status available but describe shows degarded as machine-api-controllers pod is not available | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Siva Reddy <schituku> |
Component: | Cloud Compute | Assignee: | Alberto <agarcial> |
Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | mifiedle, schituku, wking |
Version: | 4.4 | ||
Target Milestone: | --- | ||
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-13 17:27:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Siva Reddy
2020-04-16 17:29:14 UTC
The root cause making the controller break is 2020-04-16T16:50:35.2281397Z I0416 16:50:35.228108 1 publicips.go:57] creating public ip sch-02-4jc7g-sch-02-4jc7g-workload-centralus1-jpzrj 2020-04-16T16:50:35.2282496Z E0416 16:50:35.228208 1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) This is fixed in master (4.5) https://bugzilla.redhat.com/show_bug.cgi?id=1809001 And there's a PR for 4.4 https://bugzilla.redhat.com/show_bug.cgi?id=1809521 So the operator status is "legitimately" flipping between degraded = false / true as the controller comes up and then breaks while available remains true. This is usually fine as after available is true, only a payload upgrade would make the DeploymentRollout to fail (degraded true) while the existing one is still operational. We should try to come up with some smarter logic which account for this bz particular scenario where flipping is not actually a good UX and possibly set degraded = true and available = false until the controller is operational for reasonable timeframe. *** Bug 1826553 has been marked as a duplicate of this bug. *** PR is merged [1]; moving to MODIFIED. [1]: https://github.com/openshift/machine-api-operator/pull/561#event-3256381463 Verified clusterversion: 4.5.0-0.nightly-2020-04-27-204255 $ oc describe co machine-api Name: machine-api Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-04-28T02:43:10Z Generation: 1 Resource Version: 131501 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-api UID: 1ace15bb-8a86-47c5-9156-66a9c1f6109b Spec: Status: Conditions: Last Transition Time: 2020-04-28T02:56:42Z Status: False Type: Progressing Last Transition Time: 2020-04-28T02:53:22Z Status: False Type: Degraded Last Transition Time: 2020-04-28T02:56:42Z Message: Cluster Machine API Operator is available at operator: 4.5.0-0.nightly-2020-04-27-204255 Status: True Type: Available Last Transition Time: 2020-04-28T02:53:22Z Status: True Type: Upgradeable Extension: <nil> Related Objects: Group: Name: openshift-machine-api Resource: namespaces Group: machine.openshift.io Name: Namespace: openshift-machine-api Resource: machines Group: machine.openshift.io Name: Namespace: openshift-machine-api Resource: machinesets Group: rbac.authorization.k8s.io Name: Namespace: openshift-machine-api Resource: roles Group: rbac.authorization.k8s.io Name: machine-api-operator Resource: clusterroles Group: rbac.authorization.k8s.io Name: machine-api-controllers Resource: clusterroles Versions: Name: operator Version: 4.5.0-0.nightly-2020-04-27-204255 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Status upgrade 4h58m machineapioperator Progressing towards operator: 4.5.0-0.nightly-2020-04-27-204255 $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-99c6647f8-7nwc2 2/2 Running 0 4h47m machine-api-controllers-648449b654-kjhvt 4/4 Running 0 4h43m machine-api-operator-f6f66d5c7-ktzhr 2/2 Running 0 4h43m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |