Bug 1900538

Summary:	[OSP] mapi_instance_create_failed doesn't work on openstack
Product:	OpenShift Container Platform	Reporter:	Milind Yadav <miyadav>
Component:	Cloud Compute	Assignee:	ShiftStack Bugwatcher <shiftstack-bugwatcher>
Cloud Compute sub component:	OpenStack Provider	QA Contact:	Jon Uriarte <juriarte>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	low
Priority:	low	CC:	eduen, m.andre, mbooth, pprinett, zhsun
Version:	4.6	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.8.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1890456	Environment:
Last Closed:	2023-03-09 01:00:15 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1890456
Bug Blocks:

Description Milind Yadav 2020-11-23 10:17:50 UTC

+++ This bug was initially created as a clone of Bug #1890456 +++

Description of problem:
mapi_instance_create_failed doesn't work on openstack

Version-Release number of selected component (if applicable):
4.6.0-rc.4

How reproducible:
Always

Steps to Reproduce:
1.Create a failed machine by setting template to an invalid one
2.Check prometheus metrics
3.

Actual results:
Prometheus web console show "No datapoints found".

$ token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
$  oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "mapi_instance_"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 64475    0 64475    0     0   530k      0 --:--:-- --:--:-- --:--:--  533k
$ oc get machine
NAME                            PHASE     TYPE   REGION   ZONE   AGE
zhsunvs22-tr2bv-master-0        Running                          15h
zhsunvs22-tr2bv-master-1        Running                          15h
zhsunvs22-tr2bv-master-2        Running                          15h
zhsunvs22-tr2bv-worker-5d6xw    Running                          15h
zhsunvs22-tr2bv-worker-xrw84    Running                          15h
zhsunvs22-tr2bv-worker1-sjkss   Failed                           13h

Expected results:
Should show mapi_instance_create_failed detail info.

Comment 4 Pierre Prinetti 2021-04-07 13:13:34 UTC

Reproduced with:

```
$ oc get machineset -n openshift-machine-api -o json \
	| jq '.items[0].spec.template.spec.providerSpec.value.flavor="invalid"' \
	| jq '.items[0].spec.replicas=4' \
	| oc apply -f -

$ sleep 5

$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- \
	curl -sSk -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' \
	| jq \
	| grep "mapi_instance_"
```

One failed machine on AWS generates a Prometheus value; the same information does not appear on OpenStack's Prometheus.

Raising the severity, as this looks like a pretty sensitive missing piece of OCP's observability.

Comment 5 Pierre Prinetti 2021-04-07 15:30:28 UTC

Marked as "blocker-" because it's not a regression (issue found in 4.6+)

Comment 7 Pierre Prinetti 2021-04-15 16:02:33 UTC

In contrast to CAPA[1], CAPO doesn't seem to be instrumented to report create, update or delete failures to Prometheus.

The team will have to decide whether to introduce the change before the upstream rebase.

[1]: https://github.com/openshift/cluster-api-provider-aws/blob/2d4e76faac97d3e4a26d2685d8efd78173bae52e/pkg/actuators/machine/reconciler.go#L77

Comment 8 Pierre Prinetti 2021-04-21 08:56:15 UTC

Since this is not a regression, and there doesn't seem to be customer cases attached, we postpone tackling this bug until after we complete the rebase work we are planning for CAPO.

I am restoring the original priority and severity (low/low).

Comment 11 Pierre Prinetti 2021-11-25 16:04:05 UTC

Removing the Triaged keyword because:

* the QE automation assessment (flag qe_test_coverage) is missing

Comment 15 Shiftzilla 2023-03-09 01:00:15 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-8820