Bug 1824943 - machine-api status available but describe shows degarded as machine-api-controllers pod is not available
Summary: machine-api status available but describe shows degarded as machine-api-contr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.5.0
Assignee: Alberto
QA Contact: sunzhaohua
URL:
Whiteboard:
: 1826553 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-16 17:29 UTC by Siva Reddy
Modified: 2020-07-13 17:28 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:27:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-api-operator pull 561 0 None closed Bug 1824943: check minimum available time in waitForDeploymentRollout 2021-02-17 04:04:59 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:28:23 UTC

Description Siva Reddy 2020-04-16 17:29:14 UTC
Description of problem:
    The installation process is completing successfully and the machine-api CO 
reports as available to true but oc describe says degraded/progressing. Also the machine-api-controller pod is crashlooping


Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-04-16-084508   True        False         40m     Cluster version is 4.4.0-0.nightly-2020-04-16-084508

How reproducible:
Always

Steps to Reproduce:
1. Install a cluster with nightly build - 4.4.0-0.nightly-2020-04-16-084508
2. After the install, do 
#oc get co machine-api
NAME          VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-api   4.4.0-0.nightly-2020-04-16-084508   True        False         False      67m

#oc describe co machine-api
Name:         machine-api
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-04-16T15:46:21Z
  Generation:          1
  Resource Version:    37158
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-api
  UID:                 0c227449-c491-4fe6-9ec7-5bf84aa54895
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-04-16T15:46:38Z
    Message:               Running resync for operator: 4.4.0-0.nightly-2020-04-16-084508
    Reason:                SyncingResources
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-04-16T15:46:21Z
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-04-16T16:50:27Z
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-04-16T15:46:21Z
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:      
    Name:       openshift-machine-api
    Resource:   namespaces
    Group:      machine.openshift.io
    Name:       
    Namespace:  openshift-machine-api
    Resource:   machines
    Group:      machine.openshift.io
    Name:       
    Namespace:  openshift-machine-api
    Resource:   machinesets
    Group:      rbac.authorization.k8s.io
    Name:       
    Namespace:  openshift-machine-api
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       machine-api-operator
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       machine-api-controllers
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       cloud-provider-config-reader
    Namespace:  openshift-config
    Resource:   roles
  Versions:
    Name:     operator
    Version:  4.4.0-0.nightly-2020-04-16-084508
Events:
  Type     Reason           Age                  From                Message
  ----     ------           ----                 ----                -------
  Normal   Status upgrade   68m                  machineapioperator  Progressing towards operator: 4.4.0-0.nightly-2020-04-16-084508
  Warning  Status degraded  4m14s (x6 over 30m)  machineapioperator  deployment machine-api-controllers is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)

# oc get pods -n openshift-machine-api
NAME                                          READY   STATUS             RESTARTS   AGE
cluster-autoscaler-operator-d8bcfd97f-qlvkf   2/2     Running            0          63m
machine-api-controllers-76b8d649d6-v4v6d      3/4     CrashLoopBackOff   12         68m
machine-api-operator-9fbd675fc-rz5sv          2/2     Running            1          73m

Actual results:
 oc get co reports "available"
 oc describe reports "Normal   Status upgrade   68m                  machineapioperator  Progressing towards operator: 4.4.0-0.nightly-2020-04-16-084508
  Warning  Status degraded  4m14s (x6 over 30m)  machineapioperator  deployment machine-api-controllers is not ready. status: (replicas: 1, updated: 1, ready: 0, unavailable: 1)"

 oc get pods show that machine-api-controllers is crash-looping

Expected results:
machine-api should be available and pods should be in Running Status

Additional info:

related bug on 4.5 - https://bugzilla.redhat.com/show_bug.cgi?id=1812800
Logs from must gather are here: http://file.rdu.redhat.com/schituku/bug-logs/bug-1824943/must-gather-logs.tar.gz

Comment 1 Alberto 2020-04-17 10:07:55 UTC
The root cause making the controller break is

2020-04-16T16:50:35.2281397Z I0416 16:50:35.228108       1 publicips.go:57] creating public ip sch-02-4jc7g-sch-02-4jc7g-workload-centralus1-jpzrj
2020-04-16T16:50:35.2282496Z E0416 16:50:35.228208       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

This is fixed in master (4.5) https://bugzilla.redhat.com/show_bug.cgi?id=1809001
And there's a PR for 4.4 https://bugzilla.redhat.com/show_bug.cgi?id=1809521

So the operator status is "legitimately" flipping between degraded = false / true as the controller comes up and then breaks while available remains true. This is usually fine as after available is true, only a payload upgrade would make the DeploymentRollout to fail (degraded true) while the existing one is still operational.

We should try to come up with some smarter logic which account for this bz particular scenario where flipping is not actually a good UX and possibly set degraded = true and available = false until the controller is operational for reasonable timeframe.

Comment 2 Alberto 2020-04-22 10:01:41 UTC
*** Bug 1826553 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2020-04-22 23:54:42 UTC
PR is merged [1]; moving to MODIFIED.

[1]: https://github.com/openshift/machine-api-operator/pull/561#event-3256381463

Comment 6 sunzhaohua 2020-04-28 08:04:39 UTC
Verified
clusterversion: 4.5.0-0.nightly-2020-04-27-204255

$ oc describe co machine-api
Name:         machine-api
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2020-04-28T02:43:10Z
  Generation:          1
  Resource Version:    131501
  Self Link:           /apis/config.openshift.io/v1/clusteroperators/machine-api
  UID:                 1ace15bb-8a86-47c5-9156-66a9c1f6109b
Spec:
Status:
  Conditions:
    Last Transition Time:  2020-04-28T02:56:42Z
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2020-04-28T02:53:22Z
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2020-04-28T02:56:42Z
    Message:               Cluster Machine API Operator is available at operator: 4.5.0-0.nightly-2020-04-27-204255
    Status:                True
    Type:                  Available
    Last Transition Time:  2020-04-28T02:53:22Z
    Status:                True
    Type:                  Upgradeable
  Extension:               <nil>
  Related Objects:
    Group:      
    Name:       openshift-machine-api
    Resource:   namespaces
    Group:      machine.openshift.io
    Name:       
    Namespace:  openshift-machine-api
    Resource:   machines
    Group:      machine.openshift.io
    Name:       
    Namespace:  openshift-machine-api
    Resource:   machinesets
    Group:      rbac.authorization.k8s.io
    Name:       
    Namespace:  openshift-machine-api
    Resource:   roles
    Group:      rbac.authorization.k8s.io
    Name:       machine-api-operator
    Resource:   clusterroles
    Group:      rbac.authorization.k8s.io
    Name:       machine-api-controllers
    Resource:   clusterroles
  Versions:
    Name:     operator
    Version:  4.5.0-0.nightly-2020-04-27-204255
Events:
  Type    Reason          Age    From                Message
  ----    ------          ----   ----                -------
  Normal  Status upgrade  4h58m  machineapioperator  Progressing towards operator: 4.5.0-0.nightly-2020-04-27-204255


$ oc get po
NAME                                          READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-99c6647f8-7nwc2   2/2     Running   0          4h47m
machine-api-controllers-648449b654-kjhvt      4/4     Running   0          4h43m
machine-api-operator-f6f66d5c7-ktzhr          2/2     Running   0          4h43m

Comment 7 errata-xmlrpc 2020-07-13 17:27:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.