Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1826553

Summary: Machine API operator should set Degraded=True and Available=False when a controller is crashlooping
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cloud ComputeAssignee: Alberto <agarcial>
Cloud Compute sub component: Other Providers QA Contact: Milind Yadav <miyadav>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: jspeed, vrutkovs
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-22 10:01:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-04-22 00:02:53 UTC
For example [1] has:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1355/artifacts/e2e-aws/pods.json | jq -r '.items[] | select(.metadata.name | contains("machine-api-controllers")).status.containerStatuses[] | select(.name == "machine-controller")'
{
  "containerID": "cri-o://63b92c2593caf3ffb8c6beb682064775d0c27c45074039d0874a9a2de383ab2a",
  "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da927f97c7360b246a12204a11d7075ba4825a05501fd86ff802a9a2c2a63af4",
  "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da927f97c7360b246a12204a11d7075ba4825a05501fd86ff802a9a2c2a63af4",
  "lastState": {
    "terminated": {
      "containerID": "cri-o://63b92c2593caf3ffb8c6beb682064775d0c27c45074039d0874a9a2de383ab2a",
      "exitCode": 2,
      "finishedAt": "2020-04-21T17:09:35Z",
      "reason": "Error",
      "startedAt": "2020-04-21T17:09:35Z"
    }
  },
  "name": "machine-controller",
  "ready": false,
  "restartCount": 10,
  "started": false,
  "state": {
    "waiting": {
      "message": "back-off 5m0s restarting failed container=machine-controller pod=machine-api-controllers-69b5974c7f-d8ns6_openshift-machine-api(c9dde73e-eec5-41dc-a5dc-8b8a2bbef7c5)",
      "reason": "CrashLoopBackOff"
    }
  }
}
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1355/artifacts/e2e-aws/deployments.json | gunzip | jq '.items[] | select(.metadata.name == "machine-api-controllers").status'
{
  "conditions": [
    {
      "lastTransitionTime": "2020-04-21T16:42:26Z",
      "lastUpdateTime": "2020-04-21T16:43:01Z",
      "message": "ReplicaSet \"machine-api-controllers-69b5974c7f\" has successfully progressed.",
      "reason": "NewReplicaSetAvailable",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2020-04-21T16:49:08Z",
      "lastUpdateTime": "2020-04-21T16:49:08Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "Available"
    }
  ],
  "observedGeneration": 1,
  "replicas": 1,
  "unavailableReplicas": 1,
  "updatedReplicas": 1
}
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1355/artifacts/e2e-aws/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "machine-api").status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .message' | sort
2020-04-21T16:42:26Z Upgradeable True 
2020-04-21T16:47:27Z Available True Cluster Machine API Operator is available at operator: 4.5.0-0.nightly-2020-04-21-123325
2020-04-21T16:47:27Z Degraded False 
2020-04-21T16:47:27Z Progressing False 

Proximal cause for the crashlooping was ART dropping the patch version (more on that in bug 1826265), but regardless of why the pod is crashlooping, the machine-API operator should be complaining about it and not pretending everything is fine.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1355

Comment 1 Alberto 2020-04-22 10:01:41 UTC

*** This bug has been marked as a duplicate of bug 1824943 ***