Bug 1900538 - [OSP] mapi_instance_create_failed doesn't work on openstack
Summary: [OSP] mapi_instance_create_failed doesn't work on openstack
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.8.z
Assignee: ShiftStack Bugwatcher
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On: 1890456
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-23 10:17 UTC by Milind Yadav
Modified: 2023-03-09 01:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1890456
Environment:
Last Closed: 2023-03-09 01:00:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Milind Yadav 2020-11-23 10:17:50 UTC
+++ This bug was initially created as a clone of Bug #1890456 +++

Description of problem:
mapi_instance_create_failed doesn't work on openstack

Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
Always

Steps to Reproduce:
1.Create a failed machine by setting template to an invalid one
2.Check prometheus metrics
3.

Actual results:
Prometheus web console show "No datapoints found".

$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$  oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "mapi_instance_"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64475    0 64475    0     0   530k      0 --:--:-- --:--:-- --:--:--  533k
$ oc get machine
NAME                            PHASE     TYPE   REGION   ZONE   AGE
zhsunvs22-tr2bv-master-0        Running                          15h
zhsunvs22-tr2bv-master-1        Running                          15h
zhsunvs22-tr2bv-master-2        Running                          15h
zhsunvs22-tr2bv-worker-5d6xw    Running                          15h
zhsunvs22-tr2bv-worker-xrw84    Running                          15h
zhsunvs22-tr2bv-worker1-sjkss   Failed                           13h

Expected results:
Should show mapi_instance_create_failed detail info.

Comment 4 Pierre Prinetti 2021-04-07 13:13:34 UTC
Reproduced with:

```
$ oc get machineset -n openshift-machine-api -o json \
	| jq '.items[0].spec.template.spec.providerSpec.value.flavor="invalid"' \
	| jq '.items[0].spec.replicas=4' \
	| oc apply -f -

$ sleep 5

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- \
	curl -sSk -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' \
	| jq \
	| grep "mapi_instance_"
```

One failed machine on AWS generates a Prometheus value; the same information does not appear on OpenStack's Prometheus.

Raising the severity, as this looks like a pretty sensitive missing piece of OCP's observability.

Comment 5 Pierre Prinetti 2021-04-07 15:30:28 UTC
Marked as "blocker-" because it's not a regression (issue found in 4.6+)

Comment 7 Pierre Prinetti 2021-04-15 16:02:33 UTC
In contrast to CAPA[1], CAPO doesn't seem to be instrumented to report create, update or delete failures to Prometheus.

The team will have to decide whether to introduce the change before the upstream rebase.

[1]: https://github.com/openshift/cluster-api-provider-aws/blob/2d4e76faac97d3e4a26d2685d8efd78173bae52e/pkg/actuators/machine/reconciler.go#L77

Comment 8 Pierre Prinetti 2021-04-21 08:56:15 UTC
Since this is not a regression, and there doesn't seem to be customer cases attached, we postpone tackling this bug until after we complete the rebase work we are planning for CAPO.

I am restoring the original priority and severity (low/low).

Comment 11 Pierre Prinetti 2021-11-25 16:04:05 UTC
Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 15 Shiftzilla 2023-03-09 01:00:15 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8820


Note You need to log in before you can comment on or make changes to this bug.