Bug 1820204

Summary: Several metal3 metrics don't appear on the fresh cluster, but only after the first event they measure
Product: OpenShift Container Platform Reporter: Sasha Smolyak <ssmolyak>
Component: Bare Metal Hardware ProvisioningAssignee: Zane Bitter <zbitter>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Amit Ugol <augol>
Status: CLOSED NOTABUG Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, beth.white, dmaizel, zbitter
Version: 4.4Keywords: Triaged
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-19 19:16:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
after power off
none
before power off none

Description Sasha Smolyak 2020-04-02 13:59:39 UTC
Created attachment 1675776 [details]
after power off

Description of problem:
When checking the metal3 metrics on the fresh cluster, these metrics don't appear in the log:
metal3_host_error_total
metal3_operation_power_change_total
metal3_host_config_data_error_total
metal3_operation_deprovision_duration_seconds

When powering off and back on one of the workers observed the first appearance of the metrics. Logs before and after are attached

Version-Release number of selected component (if applicable):
Openshift: 4.4.0-0.nightly-2020-04-01-005209

How reproducible:
100%

Steps to Reproduce:
1. Deploy a cluster, observe metal3 metrics in cli: curl http://localhost:8085/metrics
2. There is a number of meal3 metrics, found in metal3 pod, metal3-baremetal-operator:
metal3_reconcile_error_total
metal3_credentials_missing_total
metal3_credentials_invalid_total
​​metal3_credentials_unhandled_error_total​​
​​metal3_credentials_updated_total​
metal3_credentials_no_management_access_total
​​metal3_operation_register_duration_seconds
​​metal3_operation_inspect_duration_seconds
​​metal3_operation_provision_duration_seconds
​​metal3_provisioning_state_change_total
​​metal3_host_registration_required_total
​​metal3_delete_without_deprovisioning_total

2. According to the source code, these are also supposed to appear:
​​metal3_host_error_total
metal3_operation_power_change_total
metal3_host_config_data_error_total
metal3_operation_deprovision_duration_seconds

3. In UI turn off and back on one of the workers. Then deprovision it. 

Actual results:
The metrics of metal3_operation_power_change_total and metal3_operation_deprovision_duration_seconds appear

Expected results:
The metrics appear from the beginning

Additional info:
logs attached

Comment 1 Sasha Smolyak 2020-04-02 14:00:14 UTC
Created attachment 1675777 [details]
before power off

Comment 2 Zane Bitter 2020-05-19 19:16:04 UTC
These are all metrics that include labels. Since there's no way to know in advance which labels will show up, I guess the Prometheus client can't create anything until they start appearing. In effect every combination of labels is a separate metric that gets created on the fly when needed.

However, I can see that metal3_host_config_data_error_total isn't correctly registered in the code though. I've posted a fix in https://github.com/metal3-io/baremetal-operator/pull/532 but I don't think it will make any difference to this issue, since the other metrics mentioned are already registered.

Comment 3 Doug Hellmann 2020-09-22 16:34:35 UTC
*** Bug 1868411 has been marked as a duplicate of this bug. ***