Bug 1820204 - Several metal3 metrics don't appear on the fresh cluster, but only after the first event they measure
Summary: Several metal3 metrics don't appear on the fresh cluster, but only after the ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.5.0
Assignee: Zane Bitter
QA Contact: Amit Ugol
URL:
Whiteboard:
: 1868411 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-02 13:59 UTC by Sasha Smolyak
Modified: 2020-09-22 16:34 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-19 19:16:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
after power off (37.57 KB, text/plain)
2020-04-02 13:59 UTC, Sasha Smolyak
no flags Details
before power off (35.69 KB, text/plain)
2020-04-02 14:00 UTC, Sasha Smolyak
no flags Details

Description Sasha Smolyak 2020-04-02 13:59:39 UTC
Created attachment 1675776 [details]
after power off

Description of problem:
When checking the metal3 metrics on the fresh cluster, these metrics don't appear in the log:
metal3_host_error_total
metal3_operation_power_change_total
metal3_host_config_data_error_total
metal3_operation_deprovision_duration_seconds

When powering off and back on one of the workers observed the first appearance of the metrics. Logs before and after are attached

Version-Release number of selected component (if applicable):
Openshift: 4.4.0-0.nightly-2020-04-01-005209

How reproducible:
100%

Steps to Reproduce:
1. Deploy a cluster, observe metal3 metrics in cli: curl http://localhost:8085/metrics
2. There is a number of meal3 metrics, found in metal3 pod, metal3-baremetal-operator:
metal3_reconcile_error_total
metal3_credentials_missing_total
metal3_credentials_invalid_total
​​metal3_credentials_unhandled_error_total​​
​​metal3_credentials_updated_total​
metal3_credentials_no_management_access_total
​​metal3_operation_register_duration_seconds
​​metal3_operation_inspect_duration_seconds
​​metal3_operation_provision_duration_seconds
​​metal3_provisioning_state_change_total
​​metal3_host_registration_required_total
​​metal3_delete_without_deprovisioning_total

2. According to the source code, these are also supposed to appear:
​​metal3_host_error_total
metal3_operation_power_change_total
metal3_host_config_data_error_total
metal3_operation_deprovision_duration_seconds

3. In UI turn off and back on one of the workers. Then deprovision it. 

Actual results:
The metrics of metal3_operation_power_change_total and metal3_operation_deprovision_duration_seconds appear

Expected results:
The metrics appear from the beginning

Additional info:
logs attached

Comment 1 Sasha Smolyak 2020-04-02 14:00:14 UTC
Created attachment 1675777 [details]
before power off

Comment 2 Zane Bitter 2020-05-19 19:16:04 UTC
These are all metrics that include labels. Since there's no way to know in advance which labels will show up, I guess the Prometheus client can't create anything until they start appearing. In effect every combination of labels is a separate metric that gets created on the fly when needed.

However, I can see that metal3_host_config_data_error_total isn't correctly registered in the code though. I've posted a fix in https://github.com/metal3-io/baremetal-operator/pull/532 but I don't think it will make any difference to this issue, since the other metrics mentioned are already registered.

Comment 3 Doug Hellmann 2020-09-22 16:34:35 UTC
*** Bug 1868411 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.