Created attachment 1675776 [details] after power off Description of problem: When checking the metal3 metrics on the fresh cluster, these metrics don't appear in the log: metal3_host_error_total metal3_operation_power_change_total metal3_host_config_data_error_total metal3_operation_deprovision_duration_seconds When powering off and back on one of the workers observed the first appearance of the metrics. Logs before and after are attached Version-Release number of selected component (if applicable): Openshift: 4.4.0-0.nightly-2020-04-01-005209 How reproducible: 100% Steps to Reproduce: 1. Deploy a cluster, observe metal3 metrics in cli: curl http://localhost:8085/metrics 2. There is a number of meal3 metrics, found in metal3 pod, metal3-baremetal-operator: metal3_reconcile_error_total metal3_credentials_missing_total metal3_credentials_invalid_total metal3_credentials_unhandled_error_total metal3_credentials_updated_total metal3_credentials_no_management_access_total metal3_operation_register_duration_seconds metal3_operation_inspect_duration_seconds metal3_operation_provision_duration_seconds metal3_provisioning_state_change_total metal3_host_registration_required_total metal3_delete_without_deprovisioning_total 2. According to the source code, these are also supposed to appear: metal3_host_error_total metal3_operation_power_change_total metal3_host_config_data_error_total metal3_operation_deprovision_duration_seconds 3. In UI turn off and back on one of the workers. Then deprovision it. Actual results: The metrics of metal3_operation_power_change_total and metal3_operation_deprovision_duration_seconds appear Expected results: The metrics appear from the beginning Additional info: logs attached
Created attachment 1675777 [details] before power off
These are all metrics that include labels. Since there's no way to know in advance which labels will show up, I guess the Prometheus client can't create anything until they start appearing. In effect every combination of labels is a separate metric that gets created on the fly when needed. However, I can see that metal3_host_config_data_error_total isn't correctly registered in the code though. I've posted a fix in https://github.com/metal3-io/baremetal-operator/pull/532 but I don't think it will make any difference to this issue, since the other metrics mentioned are already registered.
*** Bug 1868411 has been marked as a duplicate of this bug. ***