+++ This bug was initially created as a clone of Bug #1890456 +++ Description of problem: mapi_instance_create_failed doesn't work on openstack Version-Release number of selected component (if applicable): 4.6.0-rc.4 How reproducible: Always Steps to Reproduce: 1.Create a failed machine by setting template to an invalid one 2.Check prometheus metrics 3. Actual results: Prometheus web console show "No datapoints found". $ token=`oc sa get-token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' | jq | grep "mapi_instance_" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 64475 0 64475 0 0 530k 0 --:--:-- --:--:-- --:--:-- 533k $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsunvs22-tr2bv-master-0 Running 15h zhsunvs22-tr2bv-master-1 Running 15h zhsunvs22-tr2bv-master-2 Running 15h zhsunvs22-tr2bv-worker-5d6xw Running 15h zhsunvs22-tr2bv-worker-xrw84 Running 15h zhsunvs22-tr2bv-worker1-sjkss Failed 13h Expected results: Should show mapi_instance_create_failed detail info.
Reproduced with: ``` $ oc get machineset -n openshift-machine-api -o json \ | jq '.items[0].spec.template.spec.providerSpec.value.flavor="invalid"' \ | jq '.items[0].spec.replicas=4' \ | oc apply -f - $ sleep 5 $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- \ curl -sSk -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/label/__name__/values' \ | jq \ | grep "mapi_instance_" ``` One failed machine on AWS generates a Prometheus value; the same information does not appear on OpenStack's Prometheus. Raising the severity, as this looks like a pretty sensitive missing piece of OCP's observability.
Marked as "blocker-" because it's not a regression (issue found in 4.6+)
In contrast to CAPA[1], CAPO doesn't seem to be instrumented to report create, update or delete failures to Prometheus. The team will have to decide whether to introduce the change before the upstream rebase. [1]: https://github.com/openshift/cluster-api-provider-aws/blob/2d4e76faac97d3e4a26d2685d8efd78173bae52e/pkg/actuators/machine/reconciler.go#L77
Since this is not a regression, and there doesn't seem to be customer cases attached, we postpone tackling this bug until after we complete the rebase work we are planning for CAPO. I am restoring the original priority and severity (low/low).
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-8820